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application, it is inevitable that programmes of instruction in statistics cannot 
hope to present all aspects of the subject in full detail. A judicious choice of 
material is necessary. Rather than omit specific areas of study it is preferable, 
and indeed common, to cover some topics by means of short courses designed 
to provide a brief introduction to their methods and applications. Such courses 
present the student with a general survey of the topic and provide a springboard 
for more detailed individual enquiry, or for more specialised formal study at 
a later stage. 

The collection and processing of data from finite populations is an important 
statistical topic from the practical and utilitarian standpoint, and can be a 
complex field of study in terms of statistical theory and methodology. Modern 
society abounds with published and broadcast reports of sample surveys which 
im to describe the world we live in. In such surveys, samples are drawn from 
e populations and are used to reflect on the population they claim to 

sent, or indeed are even extended inthe aimed import to wider 


alising in, thi aspect Of) tatistics it is 
reasonable to expect his training x include a af orehensii treatment. 
Most likely this will be one of the subje oa vered by the type of skort ‘special 
topics’ course referred to above. x ; 

The study of sample survey méthods containsmé ny aspects. It inValves setting 
up appropriate statistical principles, and con rueting suitable statistical 
methods for collecting and analysing data from finite populations. But the 
implementation of such methods is bound up with sociological, psychological 
(and other) considerations. To know that a particular method of sampling and 
estimation is statistically desirable does not necessarily ‘mean that it can be 
easily applied. Statistical propriety is essential. But such questions as practical 
access to the population of interest, the social acceptability of an enquiry, 
personal bias in the response to questionnaires depending on the formulation 
of the questions, all impose difficult non-statistical considerations. All these 
various aspects have been widely discussed individually in detailed texts. But 
what appears to be lacking is a concise modern treatment of the subject at an 
intermediate level. This monograph is designed to form the basis of a short 
course of instruction, perhaps constituting about 15 lectures. Inevitably it 
cannot fully scan the field, and it is directed principally to the study of the 
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statistical aspects, whilst keeping in mind the problems involved in their 
ook is based on short lecture courses given, in the Universities of 
Birmingham, Western Australia and Newcastle upon Tyne, predominantly to 
senior undergraduate and postgraduate students in statistics, but also on an 
interdisciplinary basis. It discusses the principles of different methods of 
probability sampling from finite populations in relation to their relative ease 
and efficiency for estimating properties of the population of interest. Chapter 
1 considers some non-statistical aspects of survey sampling, and sets up the 
model for probability sampling. Chapter 2 is concerned with simple random 
sampling as a basis for estimating population means, totals and proportions. 
In Chapter 3 we discuss ratio and regression estimators which can exploit 
auxiliary information on additional variables in the population. Chapters 4 
and 5 consider situations where further structure exists in the population and 
simple stratification or clustering methods may be appropriate. Some more 
complicated probability sampling schemes are described briefly in Chapter 6. 

The emphasis is methodological; properties of different sampling schemes 
and estimators are discussed qualitatively as well as being formally justified. 
The treatment is at an intermediate level with mathematical proofs being 
heuristic rather than fully rigorous. A knowledge of elementary probability 
theory and statistical methods is assumed, such as would be obtained from a 
introductory course in Statistics. One unifying feature of the book is the 
empirical study of an actual simple finite population. Throughout the book 
the application and relative merits of different methods of estimation are 
demonstrated experimentally by constructing frequency distributions of esti- 
mates from this population. This augments and illustrates the theoretical 
discussion of the various techniques. 

It is hoped that, as well as providing a basis for a short course for statistics 
students, the book may also serve as an introduction to the statistical methods 
of sampling for those involved in such work at a practical level in various 
fields of application, including business administration, medicine, psychology 
and sociology. 

In am indebted to the Literary Executor of the late Sir Ronald A Fisher, 
F.R.S., and to Dr Frank Yates, F.R.S., and to Longman Group Ltd, London, 
for permission to reprint Table 1 from their book Statistical Tables for Biological, 
Agricultural and Medical Research, 

It is a pleasure to acknowledge the help of friends and colleagues. I am 
grateful to David Brook and Betty Gittus for their useful comments on certain 
sections of the material, to Shiela Boyd for computer calculations, to Ray 


White for the two cartoons in Chapter 1, and to Shirley Daglish for her careful 
Preparation of the typescript. 


Vic Barnett 
1973 


Preface to Second Edition 


The first edition of this book (under the slightly different title: Elements of 
Sampling Theory) has consistently served its intended purpose as a ‘special- 
topic’ or ‘short course’ text for the last fifteen years. Whilst its contents are as 
relevant as ever, there have inevitably been wide-ranging developments in 
sample survey principles and methods. It is essential, even in a necessarily 
selective treatment, to take such developments into account in offering a 
digestible but representative treatment of the field. 

This new edition (entitled more appropriately Sample Survey Principles and 
Methods) adopts a broadly similar overall structure to the original one, but 
expands the coverage on several fronts to reflect present emphases and some 
new or extended methods. The major innovations include 


° much greater attention to non-statistical, organisational, problems such as 
pre-survey sampling, sources of errors, obtaining the chosen sample, 
non-response, question formulation, and so on 

° change of notation to more standard form 

° extended coverage of different sampling schemes, in particular in the areas 
of ratio-estimation methods and multi-stage procedures 

° wider study of more sophisticated probabilistic sampling schemes (such as 
sampling with probability proportional to size), and inclusion of some 
additional topics 

° reinforcement of practical understanding by means of more substantial 
illustrations involving attitudes and opinions as well as objective and quanti- 
tative measures. 


To ease the transition to the new edition, the overall structure of chapters 
and arrangement of topics has been retained as far as possible. The new 
material and emphases have been introduced largely by modifying or expand- 
ing the chapters of the first edition. The exceptions to this are the insertion of 
a new chapter on the practical considerations of carrying out a sample survey 
after the treatment in Chapter 2 of simple random sampling, and the dispersion 
of material on non-epsem schemes (such as pps sampling) throughout the 
book as appropriate rather than gathering it together as in the original conclud- 
ing Chapter 6. 

The broader canvas does, of course, provide material for larger courses of 
study (20-30 lectures, say) with no expansion of prerequisite demand. On the 
other hand, appropriate selection of topics still enables well-balanced shorter 
courses to be covered. 

It is hoped that the updating and expansion of the contents of this new 
edition will make it attractive to student and teacher alike. 
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It is a pleasure to welcome & few more cartoons by Ray White in this new 


edition. 


Vic Barnett 
1991 
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Introduction 


Present-day society is preoccupied with numbers. There is a desire to express 
in quantitative terms all aspects of our lives, from cooking meals to political 
ideals. The communications media, including advertising, television, radio, 
and the newspapers, maintain a constant flow of such information. No argu- 
ment is complete without ‘the figures to back it up’. Thus we may read (or 
hear) that: 


‘house prices in the U.K. increased by 8% over the last 4 months’, 
200 million viewers throughout the world watched last night’s prize 
fight on television’, 

‘4 households out of 10 now have a microwave cooker’, 

‘in 1990 almost 20% of workers in Greater London spent more than 
2 hours of the working day travelling to and from work. 


The presentation of such figures is designed to keep us informed of the situation 
in the world around us; it is often used to ‘justify’ some proposal or criticism, 
or at least to place a discussion ‘in the proper perspective’. Figures on drinking, 
or smoking, habits may be presented in support of changes in the traffic laws, 
or as a partial explanation of variations in health in different sectors of the 
community. Results of opinion polls may be advanced as predictions of the 
outcome of a forthcoming election or to illustrate the need to change laws in 
relation to changing social attitudes. 

Undoubtedly individuals are better informed than ever before, in the sense 
of being more exposed to quantitative descriptions of the world in which they 
live. This is surely a good thing, but it imposes serious demands on both the 
recipient and exponent of numerical information. On the one hand the ‘man 
or woman in the street’ needs to be able to understand and interpret the 
information that is presented. Some rudimentary knowledge of statistics is 
obviously desirable (if not common). The ease with which data can be misrep- 
resented or misinterpreted makes one sympathise with the old cry of ‘lies, 
damn lies and statistics’-—the prescription cave emptor has some justice! But 
on the other hand, there is a double responsibility on those who present 
statistical data: to do so fairly and objectively with no intent to deceive, and 
to provide sufficient detail on the source, scope, and method of collection of 
the data for proper interpretation or further analysis. 

The respective demands are not always met. In spite of an improvement in 
numeracy, the attitude to statistical data can still be one of bemusement or 
suspicion rather than of understanding or enlightenment, even though 
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increasing awareness in the media and wider teaching of statistics in the schools 
are having their effect. 

Then again, the presentation of data sometimes leaves much to be desired. 
Imprecise or incomplete statements, invalid inferences, graphs with distorted 
or unspecified scales (or relating to ill-defined concepts with psuedo-scientific 
names), and pictorial diagrams with psychological impact different from their 
factual basis all confuse the recipient. Whether such devices are deliberate, or 
merely arise through lack of statistical expertise, they can certainly serve vested 
interests to advantage. But standards are improving in these matters and 
perhaps the worst excesses are behind us. The dangers are illustrated in an 
entertaining way by Huff (1973) and Moore (1985). 

The above considerations highlight the need for greater care in explaining 
statistical information, based in turn on sound knowledge and application of 
appropriate methods of collection, presentation, and interpretation of statis- 
tical data. 

The aim of this book is to provide such knowledge and skill in a particular 
context: that of data arising from sample surveys or opinion polls. The treatment 
assumes that the reader has a basic knowledge of ideas and methods of 
probability and statistics such as that presented by Huntsberger and Billingsley 
(1987) and Wetherill (1972). 


1.1 Finite populations, sample surveys 
and opinion polls 


The examples above refer to a rather special type of situation and their study 
requires a rather special expertise. What characterises them is that they all 
relate to finite populations containing a limited and clearly defined set of 
individuals or individual units: the prices of all houses sold in the U.K. over 
the last 4 months, those people throughout the world with access to television 
presentation, all people who work in Greater London, the matches in a match 
box, and so on. The population could be quite small (the 40 matches in a box) 
or very large (many million) but it is finite. 

Whether large or small, the aim is to say something about these finite 
populations by collecting and analysing information relating (in the main) to 
only a part of that population—what we call a sample from the population. 
This is obtained by surveying the population, and the study of how we should 
reasonably carry out such sample surveys is the special topic that we shall 
consider in this book. 

Such sample surveys might cover any topic relating to the characteristics of 
a finite population, but we can broadly categorise the topic areas. Most surveys 


aim to describe human populations and their environment and fall into one 
of the following groups: 


* demographic features of the population (e.g. family sizes) 
* the economic structure of the society (e.g. industrial activity) 
* patterns of life-style (e.g. travel habits) 
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e the social environment (e.g. income levels) 
° views and opinions (e.g. attitudes to central government). 


Some basic distinctions should be drawn at this stage. In principle we could 
inspect every member of the finite population: strike every match in the box 
(but note the effect of so destroying the population, a point we must return 
to) or enquire of every individual in Greater London. Such a procedure is 
called a census: it is essentially a sample survey with 100% coverage! But we 
shall be concerned with much lower levels of coverage—perhaps as low as 
1% or 5% on occasions. 

Often the information we gather on individuals is quantitative and factual, 
perhaps describing some social or economic characteristics. On other occasions 
it may be qualitative or even subjective, in the form of personal views or 
preferences. In this latter case the survey is commonly known as an opinion poll. 

If we are concerned with the qualitites of products or the attitudes of 
consumers then sample surveys can be vital, and survey sampling thus plays 
an important role in this field of market research. 

The principles and methods of collecting and analysing data from finite 
populations is a branch of statistics known as Sample Survey Methods; their 
formal basis is termed Sampling Theory. Whether the practitioner is an expert 
in a particular subject area (e.g. agriculture, industry, medicine) or a pro- 
fessional statistician, it is mecessary to understand the basic principles and 
methods that underly the efficient study of finite populations. 

These differ in an essential way from most statistical work. Usually it is 
assumed that data arise as independent observations from an infinite popula- 
tion according to some probability model. In survey sampling there is a fixed, 
determined, finite set of individuals to be observed. Probabilistic considerations 
enter only if we impose them on our method of sampling. Consider an example. 

We are interested in the lifetimes of light bulbs of a particular type produced 
by a specific company. If we observe bulbs coming off the production line, 
standard statistical methods would lead us to propose some probability model 
for the lifetimes, such as an exponential distribution. Successive light bulbs 
would then be assumed to have lifetimes in the form of random observations 
from this distribution. 

On the other hand we might have a carton of 100 bulbs. These bulbs have 
specific determined lifetimes, even if they are unknown. In examining this finite 
population of 100 bulbs no probabilistic consideration need be involved. But 
we might impose one, by picking a bulb at random, for example. This is quite 
different from the structure assumed in the standard life-time distribution 
approach above. Note another complication. If we pick a bulb at random from 
the finite population and test it by using it until it fails, we have then changed 
the very population we are seeking to study (see destructive testing in § 1.3 
below). 

The student of statistics, and the practitioner in many applied fields, needs 
to understand the special problems of finite-population sampling. 
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In these pages we shall review the elements of such sampling theory, at a 
level which will be appropriate to those with some knowledge of the basic 
statistical ideas of probability distributions, estimation, and hypothesis testing. 
If the ‘expert’ knows what he is doing, we have at least the basis for an 
improved understanding in the public at large of the import of information 
obtained from sample surveys. . 

Let us consider one or two examples from different subject areas, illustrating 
the range of considerations and difficulties which may arise. 


Agriculture 


(a) The level of food (for example, fruit) prices may be a cause of some public 
concern. To study the current situation it would obviously not be feasible to 
determine the prices charged at any time for, say, every item of fruit by every 
greengrocer or at every wholesale market in the geographic area of interest. 
But some indication could be obtained by selecting a few types of fruit and 
enquiring about prices on a selective basis from different greengrocers or 
markets. But how should we effect this choice; what practical difficulties will 
we encounter; what will our survey tell us of the overall situation? 

(b) A county council is required to submit an annual statement of the total 
wheat yield in its area. With great effort it might attempt a complete enumer- 
ation by contacting every farm. But there is no guarantee that it will receive 
a full response, or correct information, in each case. Just what is meant by a 
‘farm’ anyway; is this a suitable unit for enquiry? On cost considerations it 
might again make sense to sample on some appropriate basis. But the sample 
will need to be adequate to meet certain requirements of accuracy and validity— 
it must reasonably ‘represent’ the population. 


Education 


Suppose an enquiry is to be conducted into the attitudes of school-chilldren 
to the subjects they are studying. Questions of interest might include: ‘are you 
satisfied with your choice of subjects?’, ‘do you find some subjects more 
interesting than others?’, ‘what was the basis of your choice?’, ‘were you offered 
enough choice?’, and so on. Cost and convenience again suggest a sample 
survey rather than a total national enquiry and census data will not contain 
the information we need. Such an opinion poll might be carried out for all 
schoolchildren in a particular metropolitan region and we might hope that its 
results would have a wider (possibly national) relevance. Attitudes may be 
expected to vary with the child’s age, intellectual level, chosen subject combina- 
tions, and many other factors. An assortment of measures are simultaneously 
of interest. In the main, these are subjective rather than factual and will need 
to be elicited by individual enquiries in the form of questionnaires or interviews. 
The very way in which questions are asked can have a marked effect on the 
reaction of a child to the enquiry. Consider the two questions: ‘Do you like 
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History?’ ‘Do you dislike History?’ perhaps with a three-point attitude scale 


(‘very much’, ‘slightly’, ‘not at all’). See Section 3.4 below for further discussion 
of this topic. 


Industry 


(a) Market research is an important tool in the design of advertising campaigns, 
in the choice of types of product to be offered for sale, and in their manner 
of presentation. Public attitudes to products are commonly sought by means 
of sample surveys (opinion polls or attitudinal enquiries). 

Suppose the topic of interest is reaction to the method of packaging of 
different brands of cigarettes. We might expect that different brands will appeal 
to different groups: to office workers, teachers, young people, and so on. It is 
important that our survey both reflects the views of the different groups and 
provides a representative coverage of the groups. Some people may resent 
being approached by interviewers in the street with such an enquiry and refuse 
to co-operate. Quota sampling is a technique commonly employed to cope 
with such problems of representativity and non-response by determining how 
many responses we want in each group and instructing interviewers to fill such 
‘quotas’. But we will see that there are problems with such an approach. 

(b) Monitoring the quality of manufactured products is a vast problem for 
industry. Except in the case of complex or expensive items full inspection 
cannot be justified. Investigation must inevitably be done on a sample basis, 
and reliable sampling inspection or monitoring schemes will be needed. 


Social affairs 


(a) We might wish to study the attitudes of 18 year olds in the U.K. to their 
newly acquired ‘adult status’ which includes the right to vote. A survey is to 
be conducted of this age group ina city area, relating to those whose permanent 
address is in the city. 

Consider some of the difficulties in obtaining a sample for this purpose. 
Recognising that again we will need to sample the population rather than seek 
complete information, how are we to identify the members of the population 
we wish to sample? There is unlikely to be any complete official list of 18 year 
olds. The Electoral Register will not contain all current 18 year olds, in view 
of its method of notification being tied to some particular date. (In any case, 
it would be most inefficient to scan such a list in search of a sample of a mere 
minority of its members.) To stand on a street corner and interview people 
can quite clearly lead to an imbalanced or unrepresentative sample. Non- 
response apart, we will undoubtedly miss whole sections of our population— 
those away from the city street for various reasons, either on a temporary oF 
semi-permanent basis (in hospitals, prisons, on holiday or working elsewhere 
temporarily). Injudicious timing of the enquiry can also lead to imbalance. 
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During the week-day some will be at work, others at school, others unem- 


ployed—we may obtain many of the latter by street-corner interviews! At the 
weekend, or in the evenings, other sources of imbalance arise. 

Clearly we would wish our sample to be representative of the different 
sections of the population. Direct approach to schools, hospitals, prisons, 
employment agencies, and so on, may be a more reliable and fruitful method 
of enquiry. But other difficulties now arise, principally concerned with obtain- 
ing access to the different sources. 

(b) Family income and expenditure surveys are an important sociological 
guide. Consider some of the practical difficulties that might arise in studying 
family income in some geographical region. One fundamental matter which 
generates a deal of discussion is what is meant by a ‘family’; is it the same as 
a ‘household’? How do we deal with multiple occupancy in houses, flats, or 
institutions? What items are included as ‘income’? In short, just how do we 
define our population and the measure we wish to study? Then again personal 
resistance to such enquiries, and to the method of implementing a survey, can 
easily lead to imbalanced or inaccurate results. Door-to-door interviewing lead 
to imbalanced or inaccurate results. Door-to-door interviewing during the 
working day presents an obvious example—results are likely to lead to gross 
under-assessment. Again, consider the effects of inviting voluntary response 
to an income survey (or to current political or environmental issues)! 

(c) A Civil Defence Authority may need to assess preparedness and aware- 
ness in a particular region to a sensitive issue such as a possible nuclear attack. 
Even more problems now arise. Non-response can be high because of the 
sensitivity of the issue. The complexity of and multi-faceted nature of the survey 
may exacerbate this and cost-factors may rule out personal interviewing. A 
postal survey might seem on first thought to provide a cheaper approach but 
lower response rates and the need for follow-up reminders may outweigh this 
apparent advantage. Question design in all-important, to avoid ambiguities 
and allay fears. Retaining confidentiality of response presents a major problem. 
Responses may be related to political views or income levels but it may be 
inappropriate (or counter-productive) to enquire about these matters. Perhaps 
proxy questions (such as on newspaper readership) can help to provide such 
information. An initial, small, pilot survey may help to resolve some of the 
uncertainties. Would telephone enquiries be prudent? 


Medicine 


Great efforts are constantly being made to improve medical services in relation 
to organisation, care, and treatment. Information on prevailing experiences 
and attitudes of patients and administrators is vital. Sample surveys are widely 
employed. This is another area where responses are likely to be seriously 
affected by human factors—psychological attitudes to doctors or health work- 
ers (veneration or impatience), ignorance, or misunderstanding, can confuse 
issues and lead to wrong answers. In a recent enquiry about cervical cancer 
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a woman was asked, ‘Is your husband circumcised?’ She replied, “Yes doctor, 


very. 


Public utilities 


Suppose a major public body, say a Water Authority, wants to estimate its 
likely expenditure over the next 10 years on repair and renewal to ensure 
maintenance of high-quality service on water supply, sewage disposal, etc. 
This will depend on the present nature and conditions of its stock of pipes, 
sewers, plant, and so on. Full inspection would be impossible: they cannot 
dig up every pipe and sewer! A sample survey, augmented by past records on 
expenditure, will be needed. But how are we even to define the units to be 
sampled (streets, local authority divisions, towns, etc.)? And for any defined 
and chosen unit, what do we measure and how do we relate the measurements 
to likely future expenditure? Imagine the cost of even a single observation: 
for examination, excavation, assessment of condition, and evaluation of invest- 
ment needs for the next 10 years. It is hard to think of a more extreme example 
of the need to carefully design the survey to ensure a minimum possible sample 
size to meet required accuracy standards in the estimation of matters of the 
interest. 


The above illustrations have been carefully chosen to highlight the vast range 
of practical (essentially non-statistical) considerations and difficulties in 
designing and implementing a sample survey. They will prove useful for later 
reference in more detailed study of such practical considerations as: 


° definition of population and what to seek to measure 

* method of sampling (published data, interviewing, postal or telephone 
enquiries, etc) 

° non-response and response bias 

¢ pilot surveys 

° questionnaire design and wording of questions 

° interview technique 

* cost considerations 

° use of supplementary information 

* complex surveys 


Many of these practical topics are more fully considered in their own right 
in Chapter 3 (others arise naturally at appropriate stages in our study of the 
statistical bases of survey sampling). 


1.2 Some basic concepts and definitions 


Whilst the above examples highlight the various practical difficulties which 
can arise, they also point the need for some care in the more fundamental 
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task of defining basic concepts and principles. We must start with these basic 
concepts and definitions. There is some inconsistency in the way different 
words are used. The following definitions serve our purpose in this book, but 
minor differences may sometimes be noticed in other treatments. 

Fundamental to our studies is the idea of a finite population. Individuals in 
the population (not necessarily human, they might be light bulbs, or farms) 
have certain measures of interest. For example, our concern may be for the 
life-times of the bulbs, or the annual wheat yields of the farms. We would like 
to know certain characteristics of the population with respect to some 
measure—such as the average life-time of bulbs in the carton, the total wheat 
yield in Northumbria, or the proportion of families with incomes in excess of 
£20 000 during 1988. 

Occasionally we may be able to derive the exact value of such a characteristic 
by studying every individual in our population. More often, limited time, 
money, or access dictates that we should estimate the characteristic by studying 
some smaller group of individuals in the population (a sample of its members) 
and infer the value of the characteristic from the information provided by the 
sample and by any general knowledge we have about the population. 

Let us now commence our more detailed study by identifying some of the 
common basic concepts. 


Target population 
This is the total finite population about which we require information: for 
example, all 18 year olds in the U.K. 


Study population 

This is the basic finite set of individuals we intend to study: for example, all 
18 year olds whose permanent address is in the metropolitan area of our 
enquiry; or all wheat producers in Northumbria in 1988. This may be (as in 
the latter case) the same as the target population. Alternatively (as in the 
former case) it may be a more limited, more accessible, population whose 
properties we hope can be extrapolated to the larger target population. 


Population characteristic 

This is that aspect of the population we wish to measure: for example, the 
proportion of 18 year olds who claim that they will exercise their vote at the 
next election, the total wheat yield in Northumbria in 1988. This expresses 
some aggregate feature of the population in relation to how it varies from one 
individual to another. Each individual contributes his component (a number 
or qualitative description) for some measure of interest (voting intention, or 
wheat yield). Since this can vary from one individual to another we term it 
the variable of interest. The population characteristic will usually be a total, 
mean or proportion of this variable over the population. 
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Sampling units 

Ambiguities can arise concerning how we define or obtain access to individuals 
in the study population. Consider some examples. To investigate wheat yield 
in Northumbria, we could sample fields or farms in the region (however ‘field’ 
and ‘farm’ are defined), or perhaps sample larger administrative areas. Thus 
the potential members of the sample, the sampling units, can have different 
forms. They may be fields, farms, or administrative areas. A choice must be 
made at the outset of the enquiry; it can affect the usefulness of different 
sampling methods. 

As a further example, suppose we wish to conduct a survey of family 
expenditure in some city. Although the ‘individuals’ in our study population 
are ‘families’, some conventional definition of ‘family’ must be adopted before 
we can proceed. Even so, there is likely to be no easy means of identifying or 
sampling such ‘family’ units. It would be far easier to sample addresses and 
to seek information on families at the chosen addresses. So the addresses 
become the sampling units, even though the population of addresses is not of 
essential interest. 

Then again, in a survey on smoking and bronchitis in elderly people, we 
might most easily obtain information by approaching a sample of medical 
practitioners and asking about elderly people who have consulted them over 
a relevant period. The medical practitioners constitute the primary sampling 
units; their elderly patients are sub-units (which may be included in full in 
the survey, or further sampled—see Cluster Sampling in Chapter 6; also 
multi-stage sampling). Note also that we are not sampling all elderly people, 
but only those who have visited their doctors. 


Sampling frame 

Thus the source of our sample is inevitably the set of sampling units. This is 
called the sampling frame. Sometimes the sampling units may be the individual 
members of the study population. Often this is not so and the sampling frame 
is a coarser subdivision of the study population, with each sampling unit 
containing a distinct set of population members. 


List 

To use the sampling frame as the raw material from which to draw our sample, 
we must be able to identify the sampling units. Indeed the sampling frame is 
chosen with this in mind. At best an actual list of all sampling units may exist, 
such as, for example, the list of city addresses, or the list provided by medical 
records of all elderly patients visiting their doctors in a given area over a 
certain period. Such a list makes it particularly easy to choose the sample. But 
if no tangible list is available for consultation, we must at least set up a 
conceptual list. For example, in studying fruit prices at retail greengrocers, we 
may not have a list of greengrocers to peruse at leisure. Nonetheless our list 
consists of ‘all retail greengrocers’, and we must design our sampling scheme 
in such a way that it generates data from this list. 
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Such distinctions are fundamental to the implementation of sample surveys. 
Let us refine and illustrate some of them by reference to the examples of 
Section 1.1. The problems they present include the following (in a roughly 
hierarchical relationship one to another): , 

(i) choice of sampling units where various alternatives exist. é 

(ii) discrepancy between the ideal of a target population and the reality of 

an accessible sampling frame. 

(iii) incomplete or intangible listing of sampling units. ay 

(iv) implementation of the sample survey. Its organisation and administra- 

tion involves a complex array of problems of planning, costing, and 

instruction. Furthermore, we should note the following problems. 

(a) If different types of individual exist, our sample should reflect these 
in a balanced way; there may be different problems in sampling 
the different types of individual. In the survey of attitudes of 18 
year olds, those at school, in hospitals, at work, or unemployed, 
all present distinct sampling problems with regard to access, Cost, 
and accuracy! 

(b) Non-response in the Civil Defence enquiry can contaminate the 
results of the survey, as can psychological attitudes or inadequate 
understanding on the part of respondents to interviews or question- 
naires. 

The resolution of these difficulties must be sought at two levels. 

The practical (basically non-statistical) problems such as the choice of 
sampling units, administration of the survey, proper design of questionnaires, 
or adequate instruction of interviewers, and the like, require experience in a 
variety of applied disciplines. Detailed knowledge of the special features of 
the actual area of application of the survey (agriculture, medicine, social 
affairs, etc.) must be combined with advice from the psychologist on ques- 
tionnnaire design or on psychological test procedures, from the sociologist (or 
other appropriate expert) on the availability of relevant lists and records as a 
basis for the choice of sampling frame, and perhaps from the computer 
specialist on the automatic processing of the resulting data. A great deal of 
organised study, which can take us some way in avoiding the pitfalls, has gone 
into such matters. But we must ultimately depend on the native good sense 
of the organisers of a survey in the way in which they exploit the local 
circumstances and learn from their own experiences. Preliminary pilot studies 
in advance of the main survey can be a valuable aid. 

In contrast to such essentially non-statistical, practical, problems those which 
relate to such matters as the representativeness of a survey, its validity, the 
choice of appropriate sampling procedures, methods of estimation of popula- 
tion characteristics (and the properties of these estimators) and legitimate 
interpretation of the results, all depend vitally on a proper understanding and 
application of statistical ideas. A sound statistical basis in the design of a 
sample survey is vital; ‘practical’ difficulties of implementation can reduce its 
effectiveness and must therefore be resolved as far as is possible. On the other 
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hand, a survey which presents no such ‘practical’ problems is not able to be 
fully exploited if its statistical basis is inadequate; it is virtually useless if a 
total disregard of statistical design considerations makes it impossible to 
interpret or measure the accuracy of the results. Study of the appropriate 
statistical theory and methods may indeed be viewed as the basic theme of the 
book, and it is on this which we must now embark. 


1.3. Why sample? 


We have seen how the object of our enquiries is a population consisting of a 
finite number of individuals, on each of whom some measure is observable. 
We want to characterise the population by some aggregate expression of that 
measure—perhaps its mean, or total over the population. It is natural to ask 
‘why not observe every individual in the population and thereby obtain the 
exact answer?’ 

In some cases, where the population is small and easily accessible, this is 
obviously a sensible policy. If I want to determine how much loose change I 
have in my pocket, I am hardly inclined to take a sample of the coins and to 
try to estimate the total value! Alternatively, a full enumeration may take place 
for a very large population when there is substantial social importance to 
justify the vast expenditure. This arises, for example, in national Censuses of 
(a limited number of) facts about every member of the country’s population. 
But such cases are rare. More commonly it really does make sense, for a variety 
of reasons, to restrict our study of the population to the task of merely sampling 
some of its members and of using the information gained in this way to infer 
the characteristics of the population as a whole. What are these reasons? 


Cost 


There will be a limit on the resources, in terms of money, time or effort, that 
we can apply. This is the main obstacle to a complete enumeration of the 
population. There is the need also to counterbalance precision and expense. 
Cursory inspection of a large number of individuals (possibly even the whole 
population) may yield, in view of inaccuracies of measurement, far less precise 
information than that obtained from more careful inspection of some judi- 
ciously chosen smaller sample. The use of alternative methods of medical 
testing provides a good illustration of this effect! 

Furthermore we shall see how, even within the limitations imposed by some 
budget, different methods of sampling can yield ( for the same size of sample) 
estimates of dramatically different precision. Finally we shall see how the 
additional precision which arises from increasing the sample size becomes, 
typically, less and less valuable in relative cost terms. 

Differential cost factors are also relevant. In sampling the views of 18 
year-olds we may have to conduct face-to-face interviews with those in some 
group (e.g. in hospital) but write letters to those in another group (e.g. 
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temporarily out of the area). The unit costs of sampling in these two different 
‘strata’ are likely to be markedly different and the very sampling design which 
we choose to employ must reflect this difference: perhaps we will have to take 
a relatively smaller sample of those in hospital than of those who are away 
from home, or we may have to sample the former group in ‘clusters’ (all those 
in particular hospitals) to control contact and travel costs. 


Utility 

In some instances our sampling units may be destroyed in the process of 
sampling. Here a complete study of the population is sterile even if we can 
afford it. There is often no point in knowing all about the population if it no 
longer exists for the exploitation of our knowledge. Thus a manufacturer of 
light bulbs, or matches, is not going to test the lifetime of each bulb, or strike 
each match, to demonstrate the quality of his product. After such destructive 
testing there would be nothing left to sell! 


Accessibility 


Frequently there is different ease of access to different sampling units as we 
have remarked above. Some may not be even observable at all. Again we may 
be compelled to accept only a sample from the population. For example, 
historical records may be incomplete—temperature or rainfall readings over 
some period of interest may have been recorded sporadically; contemporary 
attitudes to some controversial issue may have been incompletely recorded 
and we cannot recreate the circumstances for fuller study. 


1.4 How should we sample? 


This is the obvious next question to ask. Its resolution will require a more 
formal specification of the finite sampling problem, and of the aims and 
objectives of a sample survey. This will occupy our attention for much of the 
remainder of the book. But we can usefully proceed a little further on a purely 
intuitive basis. The general aim must be to draw a sample which is an ‘honest 
representation’ of the population, and which leads to estimates of population 
characteristics with as great a ‘precision’ or ‘accuracy’ as we can reasonably 
expect for the cost or effort we are able to expend. 

Various pragmatic or intuitively appealing methods of sampling have been 
advanced, and are widely applied. Such ad hoc methods include the following. 


Accessibility or haphazard sampling 


With the prime stimulus of administrative convenience, a sample is chosen 
with sole concern for its ease of access. We take the most easily obtainable 
observations. The pitfalls in terms of lack of ‘representativeness’ are obvious. 
Consider the following examples. 
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(i) To study the sizes of lumps of coal brought to the surface in pit trucks, 
a few lumps are removed from the top of each truck. 

(ii) In an opinion poll on colour prejudice volunteers are sought to answer 
the questions. 

(iii) In an investigation of working habits of married women a door-to-door 
enquiry is conducted in a middle-class suburban area on a weekday 
afternoon. 

(iv) Readers of a magazine are invited to complete and return a question- 
naire published in the magazine. 

The inevitable shortcomings of such samples as guides to the population as 

a whole are obvious in these examples. In other situations they may be less 
obvious—but they may be no less serious! 


Judgmental or purposive sampling 


The attitude here is quite different. Recognising that the population may well 
contain different types of individual, with differing measures and ease of access, 
the experimenter exercises deliberate subjective choice in drawing what he regards 
as a ‘representative’ sample. The results of such a sampling procedure can be 
very good, if the experimenter’s intuition or judgment is sound and it has to 
be recognised that some surveys may employ this principle to some extent. 
Judgmental sampling aims at the elimination of anticipated sources of distor- 
tion; but there will always remain the risk of distortion due to personal 
prejudices, or to lack of knowledge of certain crucial features in the structure 
of the population. This latter factor is well illustrated by the presence of 
unrecognised correlations between the criteria of choice of the sample and the 
measure being studied. This arises in the well-known example of sampling 


ore 
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ash content in coal by taking some coal from the edge of each of a set of piles 
of coal in order to obtain a ‘representative’ sample of the different piles of 
coal. But the ash content varies with the size of the lumps of coal. So, whilst 
representing the different piles, the sampling procedure ignores the effect of 
size. We will tend to pick smaller lumps and the sample will be far from 
‘representative’ in this other crucial respect. 


Quota sampling 


Often judgment and accessibility are combined. For example, in quota sampling 
(see Section 5.6) people may be interviewed in the street in an attempt to 
obtain a sample judged to be well representative of different ages, sex, occupa- 
tions, and so on. But this involves an element of accessibility: the “most 
promising looking passers-by’ are chosen to fill the quotas. This method is ~ 
usually more structured than straight accessibility or judgemental sampling. 
A proper statistical design may have been used to determine what numbers 
are needed in each of the quotas, but some subjectivity and arbitrariness of 
choice of their constituent members cannot be avoided even with the most 
careful instruction of interviewers. 


le 


Whilst these methods are progressively less prone to dangers, the major 
criticism of each of them is not that it may lead to unrepresentative samples, 
but that its results are unconvincing (and too easily dismissed if unpalatable) 
because there is no yardstick against which to measure ‘representativeness’ or to 
assess the propriety or accuracy of estimators based on such a sampling principle. 
Such a yardstick is vital! 

For this reason we are compelled to introduce an element of ‘randomness’ 
into sampling procedures and to draw our samples according to some imposed 
probability mechanism. Such an approach is essential if we are to have a 
rationale for describing the representativity and precision of our survey and 
its resulting estimators. A variety of probability sampling schemes have been 
devised, evaluated, compared, and utilised, and we shall study these in some 
detail as the only sound basis for survey sampling. 

To proceed with this we must begin to formalise our finite population model 
and our sampling objectives. 


1.5 The central concept: probability sampling 


Let us suppose that, in an effort to study some finite target population, we 
have resolved the matter of the choice of appropriate sampling units and of 
the sampling frame they comprise. Suppose that the sampling frame represents 
the accessible finite population, and that the sampling units are the individual 
members of that population. We shall refer merely to the ‘population’ and its 
‘members’ or ‘individuals’. Our interest centres on the values taken by some 
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variable, Y, for the different members of the population, and on aggregate 
measures of this variable over the population. Thus if there are N members, 
we can represent the population by 


Ki ree s+ Yn; 
these being the values of Y taken by the different members. 
We will be interested in population characteristics defined with reference 
to Y. Those most commonly studied are: 


N 
(i) The population total, Yr= > Yi, 
i=l 


N 
(ii) the population mean, Y=— } Yi=Y7/N, 


1 


co 
N i 
and 
(iii) the proportion, P, of members of the population which fall into some 
category of classification for the measure ae 
For example, in a social survey of car-driving habits in an adult 


population, P may be the proportion who drive more than 10 kilometres 
each day. 


The aim of the sample survey will be to estimate one or more of these 
characteristics from the information contained in a sample of n(<N) members 
from the population. Suppose the values of Y for the sample are 


Vis V20---5 Nn 


where each y; is one of the values Y;, of Y, in the population at large. Not 
all Y; are necessarily different; the same is true of the yj. Strictly speaking, 
different y, might arise from the same Y, (if we are sampling ‘with replacement’ 
in the sense that a particular population member could be chosen more than 
once in the sample) but we shall assume (other than in parts of Chapters 2 
and 6) that this is not so. So, unless otherwise specifically stated, the sample 
is assumed to have been drawn ‘without replacement’—once an individual 
member of the population has been chosen it cannot be chosen again. This is 
the usual situation in survey sampling. (The parallel study of sampling with 
replacement uses traditional statistical ideas and does not involve many of 
those special considerations relating to sampling without replacement from a 
finite population. However, it will be informative later to illustrate some of 
the distinctions that arise in the two cases.) 

Although Y may in practice be multivariate (consider the response of an 
individual to the set of questions in a questionnaire), we shall concentrate in 
the main on situations where Y is univariate. Problems of simultaneous estima- 
tion of characteristics of the components of multivariate Y, or of associations 
between these components, will not be directly considered. But we will later 
be considering how joint observation of two associated variables may lead to 
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more efficient estimation of the characteristics of one of them by exploiting 
any relationship between then, and also how to estimate ratios of characteristics 
of the components of such bivariate Y. (See Chapter 4.) 


Terminology 
The ratio of sample size to population size 


f=n/N 


will be called the sampling fraction. 
To estimate Y;, Y, or P we will need to calculate some summary measure 
of the sample. Thus to estimate Y it might seem appealing to use the sample 


mean 


if4s 
= 


sgh 
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But how are we to assess the propriety of such an estimator? One possibility 
is to enquire how values of j may vary in relation to Y from one occasion to 
another when we employ the current sampling procedure in the same problem. 
However, with accessibility or judgmental sampling (or even to an extent with 
quota sampling) we cannot answer such a question; the lack of any objective 
sampling principle amenable to repetition rules it out. 

Consequently we are led to introduce some probability mechanism as a 
means of drawing samples, and must consider the idea of: 


Probability sampling 


In general terms, we firstly specify the size, n, of sample to be drawn. We then 
consider (conceptually at least) all possible samples of size n that could be 
drawn from the population; S,, S,,...: that is, each S; is a distinct sample of 
size n drawn from the whole population. Note that ‘distinctness’ relates to 
population membership, not necessarily to the values taken by the measure Y. 

A probability sampling scheme is defined by assigning a probability z; to 
each S,; and a particular sample, S, can then be chosen in accord with this 
probability scheme. A vast assortment of different probability sampling 
schemes are possible, corresponding to different probability distributions 
am ={77, 7,...} over the set of possible samples, S,, S>,.... These range from 
simply defined and implemented schemes to highly sophisticated ones. We 
shall be considering an assortment of the more straightforward schemes that 
are commonly used, and comparing them in terms of their cost and their 
efficiency for the estimation of Y, Y;, and so on. 

With such a probability-based sampling principle we are now able to discuss 
the ‘representativeness’ of the sample (in terms of its method of generation) 
and the ‘accuracy’ of estimators, employing the usual statistical concepts. 

Suppose that 6 is some population characteristic (it may be Y;) and that 
we choose to estimate it by some function, 6(S), of the sample. 6 is called a 
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statistic or estimator. We can discuss the properties of the sampling scheme 
and of the estimator in terms of the sampling distribution of 6 induced by the 
probability distribution 7. Different values of 6 will be encountered on different 
occasions, with probabilities determined by a. 


Unbiasedness 


One possible criterion on which the sampling scheme may be judged 
‘representative’ is that @ should be unbiased. That is to say 


E,[6(S)]= 8, 


where E is the expectation operator. 
We shall give prime attention to unbiased estimators. Only occasionally are 
we prepared in sample survey work to ‘trade bias for precision’. — 


Precision 

Often the estimator 6 has, at least in large samples, an approximately normal 
distribution. It is reasonable, therefore, to assess the ‘accuracy’ or ‘precision’ 
of an unbiased estimator by considering its variance, 


Var [6(S)] = E,{[6(S) — 6F} 


_the smaller this variance, the more ‘precise’ the estimator. If, for a given 
size of sample, one unbiased estimator has lower variance than another, we 
say it is more efficient. In this way we can compare estimators or different 
probability sampling schemes. 

If 6 is biased, Var We )] needs to be replaced by the mean square error, 


M.S.E. [6(S)]= Exn{L9(S) — V3 


_the smaller this quantity, the better (or more ‘precise’) the estimator. Very 
occasionally, a biased estimator with small M.S.E. may be preferred to an 
unbiased one with larger variance, but we shall concentrate attention on 
unbiased estimators in the main. 

The broad aim of sampling theory is to devise sampling schemes which are 
economical and easy to operate, which yield unbiased estimators, and which 
minimise the effects of sampling variations. 

This latter factor is reflected by Var [6(S)] for an unbiased estimator. In 
general Var [6(S)] decreases with increase in sample size, but the cost increases. 
We must effect a balance. We shall be comparing sampling schemes to deter- 
mine which of them yields an unbiased estimator with smallest variance for a 
given cost, or for a given sample size. We must also tackle the inverse problem 
of choosing the sample size to yield prescribed precision (in terms of 
{Var [6(S)]}~’). HS: 

The ease of operation and administration of a sample survey is an important 
aspect of cost reduction. We shall sometimes find that sheer administrative 
convenience (rather than direct concern for precision) can override other 
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factors in promoting a particular method of sampling. (See cluster sampling, 
systematic sampling, and quota sampling.) 


1.66 Areal-life study population: a statistics class 


To illustrate the theoretical results obtained at later stages, we shall simul- 
taneously construct empirical sampling distributions from an actual finite 
population. By considering this actual population, we will be able to observe 
whether anticipated theoretical properties of sampling schemes actually do 
arise in practice. 

The population consists of all 25 members of a university class of statistics 
students taking a short course in Sampling Theory. The variables recorded 
were the weight, height, and sex of each member of the class. Heights and 
weights are denoted Y and X respectively, and recorded in cm (in excess of 
150 cm) and kg (in excess of 45 kg), respectively. Figure 1.1 presents the details 


LECTURER 


1 
F 6 


Population: 25 members of a statistics class 
Variables: Sex (Mor F) 


Height (in cm, in excess of 150 cm); Y; 
Weight (in kg, in excess of 45kg);  X 


Principal variable: Y, (height) 


Population characteristics: P = proportion of men = 0.6 
Y = mean height = 23.9 
X = mean weight = 20.9 


Fig. 1.1. The statistics class 
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of this population, including their seating positions in the first lecture, with 
rows labelled I-V, and columns labelled A-E. 

Whilst the population arises from real-life, it is not suggested that the 
procedures subsequently applied to this population, or the assumptions made 
at different stages, are realistic in practical terms. We would not apply such 
ideas to such a small population in any case. Its study is purely illustrative, 
serving to endorse the formal results obtained in later sections. 


Example 


Suppose we adopt one or other of the two intuitive sampling 
methods described above to draw a sample of size 5 from the 
Statistics Class of Figure 1.1, for the purpose of estimating the 
mean height of the class. 

No prescription can be given for drawing an accessible or judg- 
mental sample. The type of sample chosen will vary with the whim 
of the sampler. This is the disadvantage of such methods. 

However, to some, a fairly obvious accessible sample is con- 
stituted by the first row. The sample mean is 


ja =55/5= 11, 


a very low value in comparison with the population mean, 239: 
But this is hardly surprising; all members of the first row are 
female—compared with the predominance of males in the popula- 
tion as a whole. 

In drawing a judgmental sample we consciously attempt to avoid 
such ‘unrepresentative’ samples. We might choose one member 
from each row, say, by taking the diagonal from IA to VE. This 
has 60% male members as required, and the sample mean is 


Vy = 27.4, 


which is somewhat closer to 23.9. But we have no way of assessing 
the sampling procedure in terms of values of 7, which might arise. 
Presumably, if the chosen sample has the maximum appeal to the 
sampler, he will always choose this sample. Thus he always obtains 
17.4 as his estimate. So the sampling procedure will lead to a good 
estimate if it leads to a good estimate, and vice versa! We can say 


no more regarding precision. 


1.7. Selected bibliography and references 


The treatment of sample survey theory and methods presented in this book is 
intended to be complete and self-contained in the respects outlined in the 
Preface. Nevertheless, the reader may on occasions wish to persue slightly 
wider or more detailed study of some of the topics. The following books and 
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articles might prove helpful in this respect. They are appropriately categorised 
by emphasis or content. 


General Introduction and Motivation. Kalton (1983); Stuart (1984). 
Statistical Theory and Methods. Cochran (1977); Hansen, Hurwitz and Madow 
(1953); Kish (1965); Scheaffer, Mendenhall and Ott (1986); Som (1973). 
Practical Considerations (general). Hansen, Hurwitz and Modow (1953); 
Hoinville, Jowell et al. (1978); Kalton (1983); Moser and Kalton (1971); 
Raj (1972); Yates (1981). For specific topics (e.g. telephone surveys, response 
errors) see Chapter 3. 

Accountancy. Smith (1976). 

Agriculture. Sampford (1962). 

Business and Industry. Deming (1960). 

Health. Levy and Lemeshow (1980). 

Psychology. Nunnally (1967). 

Sociology (and Politics). Hoinville, Jowell et al. (1978); Moser and Kalton 
(1971). 


(See Bibliography and References for publication details). 


1.8 Exercises 


1.1. For the illustrative examples, Social Affairs (b) and Public Utilities in 
Section 1.1 discuss the difficulties underlying the choice of an appropriate 
sampling frame, and the organisational problems likely to be encountered in 
collecting data from your preferred frames. 


1.2 Inanational enquiry into Health Service needs, it is necessary to study, 
through an appropriate sample survey, the pattern of general medical practice. 
We want to examine the ‘hospital referral’ pattern for general practitioners 
over a particular year: how many patients they send to hospital for inpatient 
or outpatient treatment; how many to consultants for specialist comment, or 
to psychiatrists; how many need the services of health visitors; and so on. Two 
alternative sampling schemes are available: complete enumeration for a small 
sample of practitioners, or a sample of individual patients throughout the 
country. Discuss the possible advantages and disadvantages of the two schemes. 


1.3. We wish to examine the views of a class of undergraduate university 
students on a lecture course they have taken on Sample Survey Methods. The 
aim is to use this feedback of information to redesign the course (if necessary) 
in as far as the method of presentation of the material effects its impact and 
understanding. Construct a short questionnaire for circulation to the members 
of the class for this purpose. 


1.4 (For class co-operation). Choose an accessible and a judgmental sample 
of size 5 from the Class Example for the purpose of estimating mean weight. 
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Contribute your estimates to the results for the whole class and construct 
histograms in the two cases. Discuss what appear to be the factors which have 
influenced the choice of samples by the two methods. 


1.5 Suppose it is necessary to draw a sample: 

(i) of telephone subscribers whose telephone numbers refer to a particular 
exchange, from a directory which covers many exchanges in addition 
to the one of interest, 

or 

(ii) of a particular set of authors, from a composite list of all books published 
by those authors. 

Discuss any difficulties which might arise in drawing reasonable samples in 
each case, and suggest methods of overcoming such difficulties. How would 
you try to contact the individuals in each case? 


2 


Simple random sampling 


The most basic form of probability sampling is simple random (s.r.) sampling. 
It is widely used in its own right and is easy to operate from the statistical 
viewpoint. It also serves as the basis for more complicated sampling schemes, 
such as stratified simple random sampling, and cluster sampling. The properties 
of estimators obtained from simple random samples may be readily demon- 
strated, and this will be the object of the present chapter. 


“ 2.1 The simple random sampling procedure 


This operates in the following way. If the population is of size N, and we 
require a simple random sample of size n, this sample is chosen at random 
from the (’) distinct possible samples, in each of which no population member 
is included more than once. That is to say, each of the () samples has the 
same probabiliy (*)~' of being chosen. Simple random sampling is an example 
of what is sometimes referred to as an epsem: an equal probability selection 
method, where each population member has the same probability of appearing 
in the sample. But more complicated sampling schemes can also be epsem’s. 
This is true, for example, in the cases, discussed later, of stratified simple 
random sampling or one-stage cluster sampling employing the probability- 
proportional-to-size ( pps) principle. 

Such a sample can be obtained sequentially: by drawing members from the 
population one at a time without replacement,* so that at each stage every 
remaining member of the population has the same probability of being chosen. 

We see that this produces a simple random sample as follows. Suppose this 
sequential method of choice yields n (distinct) population members whose Y 
values are 


Vis Y25+++5 Yn 


where y; refers to the ith chosen member (i= 1,2,..., 7). 
The probability of obtaining this ordered sequence is 


en et eed 


n Throughout the book, the term ‘simple random sampling’ will be used to describe sampling at 
random without replacement; the lack of replacement is implicit in the definition. 
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But any reordering of y,, y2,...,¥, corresponds to the same choice of n 
distinct population members (that is, the same sample); there are n! possible 
reorderings. Thus the probability of obtaining any particular set of n distinct 
‘population members (irrespective of order) is just 


nen = (N = 
ue a: 


There are (%) such sets (or samples) which can arise, and these samples 
are therefore generated with equal probabilities: that is, by simple random 
sampling, — 

The choice of individual observations in the sample is achieved at each 
stage by an appropriate random mechanism applied to the remaining members 
of the population; for example, using a table of random digits such as that 
given as Table 1 in the Appendix. 


Example 2.1 


Choose a simple random sample of 5 heights from the Statistics 
Class Example given in Figure 1.1. Numbering the population 
members 0 to 24 along the rows, starting from the top left hand 
corner, we find that the first five distinct pairs of numbers less than 
25 in Table 1 (reading along the rows) are 23, 24, 19, 09 and 06 
yielding a simple random sample 28, 41, 30, 23, 23 of heights. 

In such a simple situation this is a reasonable task. But notice 
how for larger populations and samples the effort of choosing the 
sample from a table of random digits can be quite excessive. If N 
is not 10, 100, 1000, and so on, we must either disregard a possibly 
large portion of the table, or allow multiple reference from the 
table to individual population members. We must also keep an 
account of those members which have been chosen and ignore them 
on subsequent random appearance. These factors make it desirable 
to give careful thought to the actual mechanics of using the table 
of random numbers to choose the sample, in order to keep the 
effort as low as possible. 


* We shall consider the use of simple random samples for estimating the three 
population characteristics: the population mean Y, the population total Y7, 
and the proportion, P, of Y-values in the population which satisfy some 
condition of interest. We shall need to discuss how any estimators behave in 
an aggregate sense: that is, in terms of their sampling distributions. By analogy 
with the study of traditional estimators of parameters in infinite population 
models, where the variance of the parent population is often a crucial measure, 
we need to define the variance of a finite population. 
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Variance. 
The variance of the finite population Y;, p pre eee 


gas > (Y,=—7)% 
Wale 


Notice that this is a deterministic measure, not a probabilistic average as it is 
in the case of random variable theory for probability models. The divisor (here 
N-1) is arbitrary. In other treatments of sampling theory sometimes Wi 
is used, elsewhere N. All we need is some convenient measure of the variability 
of the Y values in the population. S? is particularly convenient in that it leads 
to simpler algebraic expressions in the later discussion, and produces results 
more closely resembling corresponding ones in the infinite population context. 

Probability averaging only arises in relation to some prescribed probability 
sampling scheme. Thus for simple random sampling we have the concept of 
the expected value of y;, the ith observation in the sample. That is 


N io > 
E[yJ= ¥,Pr(w=Yj)=a 2 Y,= Y. 
j=l j=l 


The result that Pr (y; = Y;)=1/N holds because the number of samples with 
yi = Y; is (N-1)!/(N—n)!, and each has probability (N-—n)!/N! 
We easily see that 


EV)=4 ¥ ¥} 
yi N j=1 J? 
and 


2 
E(y0s)= NN =1) ye 7): 


Hence the variance of y,, and covariance of y; and y,, are 
Var (y;) = E[(yi- Y)*] 
= E(yi)- Y° 
=(N-1)S’/N (2.1) 
and 


Cov (yy) = E{(yi — Y)(yj- Y)} 


= E(yy;)- Y* 

1 N 2 N . 
“ww (3, ¥) ~ 3% Nevo] 
=—S?/N. (2.2) 


Thus, as we might anticipate, there is a small negative correlation between the 
potential sample observations. 
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The result (2.1) seems to suggest that the divisor N, not (N—1), would be 
better in the definition of the finite population variance—for the sake of 
tidiness. But later results outweigh this, and show that the adopted form, with 
divisor (N-—1), is indeed more convenient. 

We can now proceed to study the estimation of the population mean 


\/2.2 Estimating the mean, Y 


_ An estimator of Y, based on a s.r. sample of size n, with immediate intuitive 
appeal is the sample mean, 


Pounce 
epee yi- 


i=1 


Let us consider some properties of ¥ as an estimator of Y. 
We see that 


E(y)= Y, so that j is unbiased, 


1 3 us 
for, EQ - E(y,+y2t++ +++ yn) =nY/n= Y. 
Also, 


Var (7) =(1-f)S?/n, (2.3) 


Var (5) =; Var (ys) +=3 E Cov (ys) 


n [— n r<s 


for, 


= + [n(N-1)S?/N-n(n-1)S7/N] 
n 


-("5 ") S2/n=(1—-f)S?/n. 


We see in (2.3) the effect of the population being finite. The sampling variance 
of 7 is reduced by a factor f=n/N, the sampling fraction, compared with the 
analogous result for an infinite population. This effect 1s known as the finite 
population correction (f20:€:). ied 

If the sampling fraction is small, the f.p.c. has little importance, and we can 
often ignore it. As a rule of thumb we might ignore the f.p.c. if f is less than 
about 0.05. The consequence is to slightly over-state the variance of the 
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estimator, j, but such implied conservatism in any assessment of the accuracy 


of estimation of Y is usually unimportant. 


Terminology. 
The standard deviation of ¥, [Var (¥)]'””, will be called its standard error. 


So we can say of j¥ that it is unbiased as an estimator of Y, and (2. 3) enables 
us to compare it on efficiency grounds with other estimators of yt based on 
s.r. samples, or samples obtained from other sampling schemes ee, 

Also, 7 is consistent in the finite population sense: as n> N, so 7 becomes 
Y. Under appropriate circumstances we can also invoke the approximate 

normal distributional form for 7 to derive confidence intervals for Y, or to 
choose a sample size n to meet prescribed accuracy requirements. But before 
considering these matters let us pose a more basic question. 

Within the simple random sampling scheme, how well does ¥ compare with 
other possible estimators of Y? 

One property is easily demonstrated. The sample mean, J, is the best linear 
unbiased estimator of Y based on a simple random sample of size n. 

The expression ‘best’ is used here to mean ‘having smallest variance’. The 
result does not imply any global optimality among all estimators, 6(y), of Y, 
but it is useful to know that in the class of easily calculated linear unbiased 
estimators, the simple form jy is best. 

Let us confirm that this is so. Consider any linear unbiased estimator, 


t= Y aiyi, 
i=1 
with 
y~ a=1 
i=1 
to ensure unbiasedness. 
N-1 n 25° n 1 
Var OSS Ss — — 2 eect 
( ) ( N x N x sin . e ~ = 


Thus we need 


ate(r—% a 


i=1 i=1 


to be a minimum, which will be so if 


In other words a; must be constant (i=1,2,...,n), and the unbiasedness 
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condition shows that, for minimum variance, 
a, =1/n, 


yielding the sample mean, jy, as the best linear unbiased estimator. 


2.3. Random sampling with replacement 


Whilst we have noted that sampling with replacement from a finite population 
does not feature widely in sample survey work, it is nevertheless interesting 
to note how the results would differ if we chose a sample of size n by picking 
each sample member at random but with replacement, from the overall popula- 
tion of size N. 

Now, each observation has probability N ~! of being any of the population 
members (rather than successive observations having probabilities N 3h 
(N-1)7',(N- 2) '... of being any of the non-observed population members). 

Effectively, we have reverted to the conventional statistical method where 
we are drawing a random sample of size n froma discrete uniform distribution 
on the set of values (Y,, Y2,---, Yn): 

So the sample mean, J, is unbiased for the mean of the distribution, which 
is again Y=(UN, Y;)/N. Furthermore, it has sampling variance o°/n, where 
og is the variance of the uniform distribution. But 

C= 


5 (Y,— Y) =n —1)S7/ N. 
So 
Var(y =(1-)s%/n 
ar(y) = N 


compared with [1 —(n/N)]S*/n in the case of s.r.s. (without replacement). 
Thus sampling with replacement is bound to be less efficient: the relative 
efficiency is (N—n)/(N— 1). See also Section 2.8 below. 
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The expression (2.3) for Var (¥) is used in three ways: 
(i) to assess the precision of the estimator y, 
(ii) to compare y with other estimators of Y, a 
(iii) to determine the size of sample needed to yield a desired precision. 
Typically, however, we will not know the value of S?, so that to make use 
of the sampling variance (2.3) we must estimate S* from sample data. Using 
the simple random sample y;, y2,---» Yn we might try (cf. infinite populations) 


using 


y (y:-9)- 


s?= 
n-1 i=1 
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It turns out that 


E(s*)=S", 
for, 

E(s”)= - be v/N-E)| 
n— 1 [jus 
be ns ¥3N-(1-)S3/n— 7? | 
n-1 j=1 
- (S44) 
n-1 N n 
= §?. 


So that s? is unbiased for S’. 

In relation to problems (i) and (ii), above, we can substitute for the unknown 
population variance in (2.3) the unbiased sample estimator, s*, and we have 
an unbiased estimator of Var (yj) as 

s°(9) = (1—f)s*/n, 

Occasions may also arise where estimation of S* is of interest in its own 
right; again s’ serves for this purpose. But the problem (iii), where we wish 
to determine the size of sample needed to achieve a desired precision, is less 
straightforward if S* is not known. This is because the sample estimator s* is 
now of no relevance since we do not have a sample from which to extract it! 
We need to determine the required sample size prior to sampling, and we shall 
consider later in Section 2.6 how to face up to the difficulty of an unknown 
population variance. 


2.5 Confidence intervals for Y 


Example 2.2 


Suppose that for the Statistics Class described in Section 1.6 we 
decide to estimate the mean height Y from the sample mean of a 
simple random sample of size 5. Thus if our sample is 28, 41, 30, 
23, 15 the estimate of Y is 


y =27.40 


From (2.3) we see that the sampling variance of the sample mean 
is 


(1-—5/25)S7/5 


where the population variance for the Y-values is in fact 80.44. 
Thus the standard error of j is 0.45 =3.59. So the estimate 27.40 
turns out to be about one standard error in excess of Y, which we 
know to be 23.88. 
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But what of the actual distribution of ~? We can obtain some 
indication of its form from taking repeated s.r. samples of size 5. 
Five hundred such samples generated on a computer yielded values 
of » represented by the histogram* shown in Figure 2.1. The sample 
variance of the 500 sample means is 12.52, which compares well 
with the expected value 12.87. (The average of the 500 sample 
means if 23.89.) 


We see in Figure 2.1 that, even for such a small sample size and non-negligible 
sampling fraction, the sampling distribution of y shows no substantial lack of 
symmetry. But this effect is assisted by the fact that the finite population of 
Y-values itself shows little ‘skewness’. This will not always be true of course. 
We will frequently encounter skew populations in practice, often positively 
skew in the sense of exhibiting a long tail of large Y-values. Consider, for 
example, the numbers of children in different families, or family annual 
incomes. Even more extreme situations will be encountered, with a maximum 
frequency at the lowest Y-values so that we have an almost i-shaped distribu- 
tion of population values. Consider, for example, the numbers of claims on 
an insurance policy arising from a population of insured people, or the number 
of times a medical practitioner sees each of the patients on his panel over a 
given year. 


Frequency 


10 15 20 25 30 35 


7 


Fig. 2.1. Histogram of 500 values of y from s.r. samples of size 5 in the Class Statistics 


_4 But, supported by a finite population analogue of the Central Limit Theorem, 

“it can often be assumed that the s.r. sample mean has approximately a normal 
sampling distribution. That is, 

p~N(¥,(1-f)S7/n)._ (2.4) 


book, the data have been grouped in 


vs is hi m and for the others throughout the 
ee or relates throughout to the numbers of 


intervals of width 1cm. The ordinate label ‘frequency’ 
observations in such equal-sized intervals. 
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This assumption is often very reasonable even in the presence of skewness 


in the population. As a rough guide for positive skew populations, we require 
the sample size n to satisfy 


n>25G;i 


where, 


N 
Gi= U (YS FV NS: 

i=1 
(the finite-population analogue of Fisher's coefficient of skewness). In addition, 
the sampling fraction, f, should not be too large. Much discussion of the 
propriety of the normal approximation (2.4) has appeared in the literature. 
Cochran (1977, Chapter 2) gives further details and relevant references. 
/ Where appropriate we can use the normal distribution to make further 
~ inferences about Y. We might wish to construct a confidence interval for Y. 
An approximate 100(1 — a@)% symmetric two-sided confidence interval for Y can 
be written 


j—-zSV(Ai—-f)/n< Y¥<j+zSVA-f)/n, (2.5) 
where z, is the double-tailed a-point of N(0, 1). That is, if 
Z ~ N(0, 1) Pr (|Z|>z,) =a. 


But S? will not be known in practice. Replacing S* by the sample estimate, 
s*, will be reasonable, provided n is sufficiently large. By analogy with the 
infinite population case, a better allowance can be made for not knowing 5S 
by using Student’s t-distribution rather than the normal distribution, when n 
is small (less than about 40). 

We then have an approximate 100(1—a)% symmetric two-sided confidence 
interval for Y in the form 


¥—t,-\(a)sV1—-f)/n< Y<y+t,_,(a)sV—f)/n (2.6) 


where f,,_,(@) is the double-tailed a-point of t with (n —1) degrees of freedom. 
Values of z,, and ¢,_,(@) for a range of values of n are given in Table 2 of 
the Appendix, for a =0.1, 0.05, 0.02, 0.01, 0.002 and 0.001. If n is more than 
about 40, the normal distribution percentage points will usually be reasonable. 

Since sample surveys commonly relate to very large populations (say N = 
10000 or more) with substantial sample sizes (say n = 100 or more), we will 
frequently be safe in adopting the form (2.5), replacing S by s, without regard 
to the fine details of justifying the normal distributional form for 7 or serious 
concern for the sampling fluctuations of s as an estimate of S. 

However, one word of warning is needed on this latter issue. The sampling 
variance of s* is highly sensitive to the value of the fourth central moment of 
the Y-values in the finite population. Typically, the larger this fourth moment 
the larger is Var (s). This is particularly serious when we attempt to compare 
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the precision of alternative estimators and need to estimate S*. It can be even 
more serious when we need to choose a sample size to yield a required precision 
(as described in the next section). 


Example 2.3 


In a particular sector of industry a survey was conducted in an 
attempt to investigate the extent of absenteeism not connected with 
illness or official holidays. A random sample of 1000 men out of 
a total workforce of 36 000 were asked how many days they have 
taken off work, in the previous six months, as ‘casual holidays’. 
The results were as follows. 


‘Days off 0 1 2 3 4 5 6 7 8 9 
No. of men 451 162 187 112 49 21 5 11 2 


To estimate the average number, Y, of days ‘casual holiday’ taken 
by workmen in the industry we can use the sample mean 


y = 1.296. 
The sample variance is 
s* = 2.397. 


Using the normal approximation to the distribution of the sample 
mean we can obtain an approximate 95% symmetric two-sided 
confidence interval for Y as 


1.201< Y <1.391 
[or, 1.200< Y <1.392 (ignoring the f.p.c.)]. 


Note some characteristic features of this problem. The distribution of values 
in the population is obviously highly skew. This will affect the propriety of 
the normal approximation in general, although the current sample size of 1000 
is an adequate safeguard. 

Then there is the inevitable problem of assessing the accuracy of the informa- 
tion given on such a controversial issue. (Promises of confidentiality may not 
necessarily allay all concern!). If the survey involved compulsory response, 
fears for its accuracy become more acute. Suppose that a number of those 
questioned just refused to answer. How would we deal with such ‘non- 
response’: might it yield an unrepresentative sample? How would we then 
process the data to estimate » 


j 2.6 Choice of sample size, n 


Clearly an increase in sample size will lead to an increase in the precision of 
jy as an estimator of Y—but the sampling costs will also typically increase 
and there is likely to be some limit on what we can afford. Too large a sample 
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will imply a waste of resources; too small a sample is likely to produce an 
estimator of inadequate precision. Ideally we should state the precision we 
require, or the maximum cost which can be expended, and choose the sample 
size accordingly. 

Such an aim involves a complex array of considerations. What is the cost 
structure for sampling in a given situation; how do we assess the precision we 
require of our estimators; how do we balance needs in relation to different 
population characteristics which may be of interest; how do we deal with lack 
of knowledge of unknown parameters (e.g. the population variance) which © 
may affect the precision of estimators? 

We will consider only one simple situation. We assume that the object is to 
estimate a single characteristic, the population mean Y, by using a s.r. sample 
mean J, restricting to an acceptable level the probability that the absolute difference 
between Y and jf is greater than some specified value. No direct consideration 
of costs arises, although, if it happens that sampling costs are directly propor- 
tional to sample size, it turns out that we achieve our aim for minimum cost. 

Thus suppose we seek the minimum value of n that ensures that 


Pr(|Y—y|>d)Sa (2.7) 


for some prescribed d and (small) a. The sampler needs to specify the tolerance 
d, and the risk @ of not obtaining such tolerance. 
We can rewrite (2.7) as 


ia aa)? 
"(Sasa NTA) = = 


so that, using the normal approximation (2.4), we require 


d 
SV(1—f)/n 


d 274-1 
n=n|i+n(—4) | (2.9) 


Equivalently, (2:8) declares that 
Var (¥)S(d/z.)°=V (say). 


IV 


Za 


or, 


The inequality (2.9) can be written 
2 1 a 
n= S*/V|1+— S?/V 
/ n>/ : 


so that as a first approximation to the required sample size, we could take 


no = S°/V. (2.10) 
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This is an overassessment, but it will be reasonable unless the provisional 
sampling fraction, no/ N, is substantial. If this is so, we would need to reduce 
our assessment of the required sample size to 


n(1+ no/ N)"'. 


This presupposes, however, that S? is known. Such a possibility is remote in 
practice and we must face the more difficult task of estimating the minimum 
sample size required to satisfy (2.7) when S? is unknown. There are basically 
4 ways in which we might try to do this. 


a 


2 
\ fi) From pilot studies 


Often a pilot study may be conducted prior to a major sample survey (see 
Section 3.2). This can serve a variety of purposes including the study of different 
sampling frames and the examination of any implicit practical difficulties 
which may be encountered in the sampling procedure. 

If such a pilot study itself takes the form of a simple random sample, its 
results may give some indication of the value of S? for use in the choice of 
the sample size of the main survey; if the pilot sample is not obtained by a 
probability sampling procedure we must be circumspect in such an application 
of the results. For convenience, a pilot study is often restricted to some limited 
part of the population. If so, the estimate of S? which it yields can be quite 
biased. 


~ (ii) From previous surveys 


It is not uncommon to find that other surveys have been conducted elsewhere 
which have studied similar characteristics in similar populations. This is 
particularly common in educational, medical, or sociological investigations—it 
may just be that a different age-group of pupils, or results for a different year, 
or effects in a different city or social structure, are being considered. Often 
the measure of variability from earlier surveys can be used to estimate S* for 
the present population, in order to choose the required sample size to meet 
any prescription of precision in the current work. But again precautions must 
be taken in extrapolating from one situation to another. 


(iii) From a preliminary sample 


This is the most reliable approach, but it may not be feasible on administrative 
or cost considerations. It operates as follows. A preliminary s.r. sample of size 
n, is taken and used to estimate S* by means of the sample variance s;. We 
aim to ensure that n, is inadequate to achieve the required precision, and then 
to augment the sample with a further s.r. sample of size (n — n,), where (n— 1) 
is chosen by using s; as the necessary preliminary estimate cero 
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A detailed study of this procedure shows that, under reasonable conditions, 

the total sample size, ignoring the f.p.c., needs to be 
(1+2/n,)s{/V 
“6 o2 

an essential increase by the factor (1+2/n,) over what would be needed if S 
were known. — 

This approach, if feasible, is undoubtedly the most objective and reliable. 
The sampling procedure is an example of what is called double (or two-phase) 
sampling. 


~ (iv) From practical considerations of the structure 
of the population 


Occasionally we will have some knowledge of the structure of the population 
which throws light of the value of S*. Suppose we are considering 

(a) the numbers of misprints in books (of roughly the same size, or over a 

prescribed number of pages) issued by a particular publisher over a 
certain period of time, or 

(b) the number of faults that occur in video-recorders of a particular type 

in the first year of their use. 

In both cases there is reason to believe that the Y-values might vary roughly 
in the manner of a Poisson distribution, so that it is plausible to assume that 
S° is of the same order of magnitude as Y. Any information we have about 
the possible value of Y (for example, from other similar studies) can then be 
used to approximate S* and assist in the choice of the required sample size. 
Furthermore, if we can assume that S*= Y, then we can obtain an approximate 
100(1—a@)% symmetric two-sided confidence interval for Y directly, without 
the need for an estimate of variability. Using the normal approximation to the 
Poisson distribution we have 


Pr(|¥-9|<z,V ¥(1-f)/n]=1-a. 

Thus, the confidence interval is obtained in the form 
Y*—[2p+z2(1-f)/n]¥+5?<0 

i.e. as the region between the two roots of the equation 
Y*-[27+22(1 —f)/n]Y +" =0. 


Then again, if we are interested in estimating a proportion P we shall see 
that the sampling variance of the s.r. sample estimator is simply related to P. 
Reasonable bounds can be placed on this variance to obtain some idea of the 
required sample size. We shall return to this point later (Section 2.10). 


2.7 Systematic sampling 


Suppose we wish to draw a sample of size n from a population of size N and 
have available a complete list of the population members. The list is of great 
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value in specifying the sample we intend to draw. Members of the sample can 
be identified on the list, and then sought in the population. Thus we might 
sample the students in a university by using a published list of students for 
that university; or the books in a library by going through the catalogue of 
books held by the library (multiple entries in such indexes raise interesting 
sampling problems!). 

Even with such a complete list, however, the choice of a random sample 
can be tedious and time-consuming. Imagine choosing 500 students at random 
from a list of 8500. There is a great temptation to seek an easy way out, and 
one method of doing so, which is commonly employed, is to take a systematic 
sample. 

The principle is that sample members are chosen in a regular manner working 
progressively through the list. Consider the student example; since 8500= 
500 x 17, we might choose a student at random in the first 17 on the list, and 
then take every 17th student subsequently. This is a systematic sample with 
sampling interval 17. Note that it is not a strictly random sample in view of 
the deterministic method of choice of sample members after the first one. 

Whether or not N is a multiple of n, the method is readily described. Suppose 
we take a sampling interval, k, and we have N=(n-—1)k+t where 0<t<k. 
We pick an individual at random from the first t and then choose each 
subsequent kth member on the list, to obtain a systematic sample of size n, 
where n is the least integer greater than or equal to N/k. 

Such a sampling method is clearly very easy to Carry out, which is a great 
advantage, particularly in frequently repeated surveys, Or when the sample is 
chosen ‘on site’ and a list is available. Sometimes even the limited randomisa- 
tion, in the choice of the first member, is dispensed with, and the sequence is 
fully prescribed at the outset. For example, it might be dediced to take the 
9th student, and every 17th one subsequently, on the basis that this seems to 
be a systematic policy. Or consider a situation where individuals are listed in 
a card index. To save even more time we might choose cards by taking cards 
out every 1cm (say) through the index—thus obtaining a ‘more or less’ 
systematic sample. 

What are the likely effects of such a sampling approach? Administrative 
convenience is a clear benefit. We do not even need to know the population 
size N; we can merely specify a sampling fraction 1/k. Apart from the saving 
in time and effort, there is also some sort of intuitive appeal in systematic 
sampling: it seems to ‘span the population’ in a way that might lead to more 
‘representative’ results than those obtained from random choice. 

There are obvious dangers however. The procedure is not s.r. sampling since 
not all possible samples of size n have an equal chance of occurring. Related 
to this is the fact that we cannot readily obtain a variance estimate from such 
a single sample. 

If, however, there is no obvious order pattern (it is quasi-random) we can 
act as if the sample is effectively an s.r. sample and thus couple the benefits 
of such a sampling principle with the convenience of systematic choice. 
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But order can be important. Consider a medical trial in which patients have 
been listed in order of treatment: three forms of treatment, A, B and GA having 
been given in rotation to successive patients. If we now take a systematic 
sample, with sampling interval k which is a multiple of 3, we will only obtain 
patients with a specific single treatment regime (A or B or C). Thus we obtain 
a very unrepresentative sample. Such an extreme problem may not frequently 
arise, but less deterministic order patterns are not uncommon, e.g. trends (the 
listed members get successively older), groups (different parts of the list relate 
to different geographic regions) or even cyclic variations (sales data for a 
product in time order). 

All prospects in relation to s.r. sampling are possible. We can do better: for 
example, geographical groupings may improve representatively-we are then 
effectively taking a stratified sample (see Chapter 5). We could do much worse, 
as in the medical trial example above. Or we can effectively be taking just an 
s.r. sample and this is the assumption commonly made in the absence of 
obvious counter-indications. 

We shall see (Chapter 6) that systematic sampling is an example of one-stage 
cluster sampling. 


2.8 Non-epsem sampling 


In epsem sampling, of which s.r. sampling is an example, we use a principle 
of selecting sampling units with equal probability. Sometimes, however, this is 
not straightforward nor necessarily the best approach. 

Consider the example of estimating the total wheat yield Y; for wheat farms 
in Northumbria. If we could list all the farms, we could take an s.r. sample 
and use the methods described above. But suppose that we marked points at 
random on a map of Northumbria to identify our sample of n farms, and 
further assume that this had the effect of selecting the farms with probabilities 
proportional to their sizes (areas) X;. Termed pps sampling, this approach is 
not epsem. There could also be a strong correlation between the wheat yields 
Y; and the sizes X;. The extreme prospect is that Y; = KX;: that is, wheat yield 
is directly proportional to size. 

Let us examine the effect of such an approach (and prospect), if we obtain 
a sample y,, ¥2,-..,Y, chosen without replacement (to simplify the algebra) 
and wish to estimate the total wheat yield Y;. Consider the estimator 


assuming that the total size (area), X;, is known and that we also observe the 
sample x,,X2,...,X, for the sampled farms. 
For a typical wheat yield y and size x chosen by pps sampling, we have 


1 N 


ieee 
E(S)oag Gx) = Fm 
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Thus 


~ Xr Yr 
E( Y,)=——n—= 
( rT) a Px. Yr, 


so that Y; is unbiased for Yr 
Furthermore, 


Obey) | ae 1 (Mae 4 
var (2) e| (2) | Y7/X “3 3 (Z)-rs 


= | a & 2) 
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Thus 


In the extreme case, Y; = kX; and Y;= kX,. Hence 
Var (Y7) =kK?X7/n-k?X7/n 
=0 


and we have discovered the perfect estimator, ¥-, of Y; with zero variance! 

Of course this is an artificial example of no practical value! If ¥; =kXr, 
and we know X;, then we also know Yr without the need for estimation. But 
an important principle is embedded in this example. If the Y; and X; are 
correlated, albeit imperfectly and short of proportionality, pps sampling may 
still be efficient and improve on S.r. sampling. Also knowledge of X7 may still 
be critically important. We shall return to these themes, in Chapter 4 where 
we examine for s.r. sampling the value of simultaneously observing the principal 
measure Y and a correlated measure X and in Chapters 4 and 6 where we 
further consider non-epsem (pps-type) sampling, including the prospect of 
sampling with replacement. 

When we move away from s.r. sampling, there are a number of complications 
that arise in seeking to employ an arbitrary sampling scheme {s;, a}. The first 
is how to generate the samples 5; with their corresponding probabilities 7;. 
The second is how to derive the statistical properties of estimators based on 
the scheme {s;, 77;}. It is sometimes possible to alleviate both of these areas of 
difficulty by using a scheme defined in terms of sequential generation of 
successive sample members, especially one in which population members are 
chosen with replacement. A general scheme of this type arises when, at each 


stage, the different values Y;, Yo,---», Yn can occur independently with 
respective probabilities pi, P2,---» PN: Suppose we so choose a sample of size 
n, the sample values being yi, Y25+-->Yn- 


Again we might estimate Y by the simple average 


iMs 
= 


ae 3 
Hag 


1 
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The properties of y are easily determined since y,, y2,--- Yn is a random 
sample from a known probability distribution. We have 


N 
BOWS Hd Pts 
and 
~ Loe 2 2 
Var (y) =~ y PxXin-- f 
i=1 


But yu will not equal Y except in special circumstances, so that y is in general 
biased and can have large expected mean square error. 

The situation is improved if we use a different estimator in which the 
observations are divided by N times their respective probabilities of occur- 
rence, so that we use 


where q; is that member of the set {p,,p2,..., Pn} corresponding to the 
population value chosen as the ith member of the sample. Equivalently, we 
are now using the sample mean of a random sample drawn with replacement 
from a distribution in which values Y,;= Y,/ Np, arise with probabilities 
pi(i=1,2,..., N). The mean and variance of this distribution are Y and 


Ee v7 =f 
No i/ Pi > 


respectively. Thus y is now unbiased for Y, with variance 


N 
— fp (¥i/p) Nyy" (2.11) 
This variance can again in principle be arbitrarily small; if we put p; = Y,/ NY, 
then each sampled value is precisely Y and Var (») =0! 

Whilst again of no immediate practical use of course since it implies a 
knowledge of the precise value of Y in which case we would not need to 
sample the population, this pathological result does motivate a particular style 
of sampling which can often be used with advantage. It can be really useful 
to adopt a non-epsem scheme on occasions with prescribed selection prob- 
abilities p; and with-replacement sampling. 

The optimum aim is to sample the population with the probabilities of 
selection of different members proportional to their values, Y;. Without any 
auxiliary information about the population being sampled, this is unrealistic. 
But with some knowledge of the population structure, or the facility for 
sampling an auxiliary variable, X, correlated with Y, progress in this direction 
is possible. As examples, we are led to consider in this respect sampling with 
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replacement for ratio estimation, or in cluster sampling. Some further details 
are given in chapters 4 and 6 below. 


Var (yr) =(1-f)N°S?/n (2.12) 


2.9 Estimating the population total, Y; 


7 
A 
A 


There are many situations in which we are interested in estimating the popula- 
tion total 


Y, = NY 


rather than the population mean Y. For example, in a survey of annual yields 
of wheat for farms in Northumbria, the concern may be to estimate the county’s 
total annual wheat yield. In view of the simple relationship between Y7 and 
Y, no substantial extra difficulties arise; we can immediately extend the results 
we have obtained concerning the estimation of ¥ 

The s.r. sample estimator of Y; which is commonly used is 


yr = Ny, 


and the earlier results confirm that yr is unbiased for Y;; that is, 


Furthermore, yz is the minimum variance linear unbiased estimator of Y; based 
on a simple random sample of size n. 

With similar qualifications concerning the sample size, n, and value of the 
sampling fraction, f, we can use the normal approximation 


yr ~ N[Yr, (1-f)N°S*/n] 


to construct confidence intervals for Y;, or to choose a sample size to meet 
specified requirements concerning the precision of estimation of Yr. 
For example, if n is more than about 40, an approximate 100(1-—a)% 


symmetric two-sided confidence interval for Y; is given by 
yr — Z.NSv if.) /in< Y7<yrtZaNSv (1 =f) ete 


For smaller n, the use of percentage points ft, -,(@) for the t-distribution with 
(n—1) degrees of freedom, in place of Z., is preferable. 
Let us consider the question of choosing n to ensure that 


Pr (|yr - Y7|>d)=a. 
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Using the normal approximation, this requires 


3 ee cs ells (2.13) 
nen[i+e (4) | ; 


Equivalently, we require 
Var (yr) =(d/2.)’ = V, 
so that (2.13) becomes 
N’S? if NESTS 
2X F (145 ig 


ge N V 
So if NS?/V is very much less than 1, it will be reasonable to take 
a 
no=—, 
as the required sample size; otherwise we must use 
no(1+ no/ N)'. 


In some respects it is more natural to express the accuracy we require of an 
estimator of Y; (or even of Y) in proportional, rather than absolute, terms. 
Thus we might ask what sample size is needed to ensure that 


Pr (lyr —- Y,|> €Y7) <a. (2.14) 


For example we may want to be at least 95% certain that y; is within 2% of 
Y;. Then a = 0.05, €=0.02. 

But this is less straightforward. It becomes necessary to replace d in (2.13) 
by €Y;, and the appropriate value of n now depends on Y;, the unknown 
quantity we are trying to estimate. The best we can hope for is to obtain a 
rough estimate of the sample size necessary to satisfy (2.14), by replacing Y; 
in the right hand side of the inequality by some provisional estimate (possibly 
based on earlier surveys or experience). 

For example, if the y-values have an approximate Poisson distribution (so 
that S*= Y = Y,/N) then (2.13) takes the form 


n> n[1+(£) ale , 


Example 2.4 


To obtain an early indication of the total sales of Christmas cards 
throughout a network of 243 retail stationery shops, it is decided 
that a random sample of the shops should submit returns of their 
card sales by the end of January. How large a sample is needed to 
estimate total sales to within 10% of the correct figure with 95% 
assurance? 

By July of each year precise figures of total sales of cards are 
available. For the previous three years the number of shops in the 
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network has remained much the same; the total card sales, and 
standard deviations of sales from shop to shop, have been (in units 
of 10000 cards) 


Y; s 
321.7 0.826 
366.8 0.776 
401.0 0.804 


So for the current year we might reasonably expect that X; and S 
will be of the order of (say) 420 and 0.8, respectively. To obtain 
the required precision from the January returns it will consequently 
be necessary to take a simple random sample of size n, where, from 
(2.13), 


2a 
n>24s[ 14st (22S?) | 
243 \ 0.8 x 1.96 
= 243(1+2.96) * 


= 61.48. 


So a sample of size n = 62 is needed. 
Here n= N?S?/ V = 82.30, so that no/ N =0.34 and we do in fact 
need to use the more accurate expression n(1+no/ N) ‘ to obtain 
the above value of 62 for the required sample size. 

Suppose that such a sample of size 62 yields an estimate 


yr = 427.4, 

then we obtain an approximate 95% confidence interval for Y7 as 
385.6< Y7 < 469.2, 

reflecting (roughly) the 10% absolute accuracy we sought. 


When attempting to estimate Y; with prescribed precision, we again typically 
encounter difficulties because we are unlikely to know the value of S?, so that 
some preliminary indication of its value will be needed using one of the 
methods (i), (ii), (iii), or (iv) described in Section 2.6. Example 2.4 illustrates 
method (ii), in a case where the previous information arises from a complete 
enumeration, rather than from a sample survey, of a similar situation. 

Finally we should note that the population size N needs to be known if we 
are to use yz to estimate Y7 or to assess the sampling behaviour of yy. This 
is true also for the study of y as an estimator of Y. In most situations N will 
be known, or can be estimated with fair accuracy. Where this is not so, 
difficulties arise even in the very choice of an s.r. sample. 


ze /2.10 Estimating a proportion, P 


Consider an engineering process in which a special component is produced 
for use in the assembly of a car. If some dimension, Y, is not within required 
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tolerances, the component will not be able to be used. To estimate the propor- 
tion, P, of useful components in a large batch (that is, those within the required 
tolerances), ans.r. sample of size n is taken and a count is made of the number, 
r, which have satisfactory values of Y. The population of Y values is not in 
itself of interest, we would like to know merely the proportion P of such 
values which lie within the tolerance limits. 

__ Rather than studying a quantitative measure, Y, in relation to whether it 
satisfies some criterion, we may sometimes be concerned directly with some 
qualitative attribute or characteristic: for example, with the proportion P of 
the inhabitants of Renton who live in rented accommodation. Again, an s.r. 
sample of size n gives an indication of P. If r out of the n live in rented 
accommodation we might estimate P by the s.r. sample proportion, 


p=r/n. 


As in the case of the estimation of the population total, we can again readily 
modify the results for estimation of the population mean, Y, to describe the 
properties of the estimator p. Suppose P represents the proportion of members 
of a finite population of size N who possess some characteristic A. We define 
a variable X; describing the ith member of the population so that 


X;=1 if the member possesses characteristic A, 
=0 otherwise. 


Then 


is the number of members possessing the characteristic A. Consequently, 


_ N 
X=—YX,=R/N=P 
1 


1 
N 
so that the proportion P is merely the population mean for the X-values. 
Likewise, the sample proportion p is just the mean x for the sample of X 
values. In discussing the performance of p as an estimator of P we are once 
again considering the use of an s.r. sample mean to estimate the corresponding 
population mean. The only essential difference arises from the simple structure 
of the population of X-values, where only the values 0 or 1 can occur. This 


implies a relationship between the population mean X (or P) and the popula- 
tion variance S*, which takes the form 


; eer ,_ NP(1-P) 
0 SL ea ‘ae 


with a corresponding effect for the sampling behaviour of the estimator p. We 
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will put Q=1-—P. Then from the results in Section 2.2 we have 
E(p)=P, so that p is unbiased for P, 


and 


Nn) 


Var(p)=(1—f)S7/n PQ/n. (2.16) 


~ (N-1) 


But if P is unknown, S” will not be known. We can estimate S* by the 
unbiased estimator 


— 


1 n 
ge yet ee npq/(n—1), 


where g =1-—p. Thus an unbiased estimator of Var (p) is given by 


s*(p)=(1—f )pq/(n— 1). 
Note that this is not the sample analogue (N —n)pq/n(N —1) of (2.16) as we 
might intuitively think, although in practice the difference is unlikely to be 
important. 
If the sampling fraction f is negligible, the estimator of Var (p) takes the 
simple form 


s*(p) = pq/(n-1). 


This holds, in particular, when we are sampling from an infinite population. ~~ 
J 


7 


S 


L211 Confidence intervals for P 


“In sampling attributes or characteristics to estimate a proportion P, we know 
more about the sampling distribution of our estimator, p, than in the corre- 
sponding situations of estimating Y or Y;. Indeed, the exact distribution of 
p is known! The number, r, of the sample members possessing the required 
attribute has a hypergeometric distribution, 


4 
v 


r n-r 
a max (0, n—-N+R)=r=Smin(R, n). 


p(r)= N 
(") 


We could thus make exact probability statements about r as a basis for 
constructing confidence intervals for P. In principle, calculations could be 
eased by using published tables or charts of cumulative probabilities for the 
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hypergeometric distribution (particularly those especially designed for the 
purpose of making confidence statements about P). For example, Chung and 
DeLury (1950) give charts of 90%, 95%, and 99% confidence limits for P for 
N =500, 2500, and 10 000 for various values of n and p. Tables of cumulative 
probabilities for the hypergeometric distribution, for more modest values of 
N (up to 20), are given in Owen (1962). But such sources provide only a 
limited aid; it is not possible to obtain very accurate results from the charts, 
whilst direct tabulations cover a relatively small range of values and can require 
complicated inverse interpolation. Alternatively, we could calculate required 
probabilities and conduct the inverse interpolations by computer, but again 
this is a major task. Thus the knowledge of the exact distribution of r turns 
out to be rather unimportant in practice in view of the tedious calculations 
involved in the use of the hypergeometric distribution. 

Consequently we must again seek useful approximations to the sampling 
distribution of the estimator. One obvious possibility is to use the binomial 
distribution as an approximation to the hypergeometric distribution. If n is 
small relative to both R and (N-—R), the lack of replacement of sampled 
members of the population can be ignored and r has essentially a binomial 
distribution, B(n, P). We could use this binomial distribution to construct 
confidence intervals for P. But again we encounter computational difficulties. 
The binomial distribution also involves tedious calculation (with inverse inter- 
polation). Only if n is quite small, so that the calculations are reasonable, or 
if published charts or tables of confidence limits (e.g. those given in Neave 
1978, or Fisher and Yates, 1973) are immediately relevant, is it worth proceed- 
ing with the binomial approximation. 

In most applications we will find it convenient to go one stage further and 
use the normal approximation to the binomial distribution. Thus we will 
effectively assume that 


p~ N(P, (1—f)PQ/n) (2.17) 


as a basis for constructing approximate confidence intervals for P. Notice how 
this is not the immediate extension of the argument supporting the binomial 
distribution for p, since some account is taken of the ‘lack of replacement’ in 
incorporating the f.p.c. in Var(p). In comparison with (2.16), we have also 
omitted a factor N/(N-—1) from Var (p), which is justified for the sizes of 
population where the normal approximation will be used. 

The normal approximation (2.17) will be reasonable, provided: 

(i) n is not too large relative to R or (N-—R), 

(ii) the smaller of nP and nQ is not too small; for example min (nP, nQ)>30 
should suffice. If P is in the region of }, much smaller values of nP will 
be acceptable. Of course the values of nP and nQ will not be known 
and will need to be assessed through their unbiased estimators np and 
nq. 

We can determine confidence intervals for P from (2.17) in the usual manner 
of applying the normal approximation to the binomial distribution. Thus, from 
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(2.17), we have 


Frene <z |= 
Gi =f)PO/n 7" =1-a, (2.18) 


so that an approximate 100(1— a)% two-sided confidence interval for P is given 
as the region between the two roots of the quadratic equation 


P?[1+22(1—-f)/n]— P[2pt+z2.(1—-f)/n]+ p* =0. 


This can be simplified even further if n is sufficiently large. Replacing Var (p) 
in (2.17) by its unbiased estimator s*(p) we have an approximate 100(1 — a)% 
two-sided confidence interval for P 


p+[zaV(1—f)pq/(n-1).] (2.19) 


We could also introduce the usual continuity correction to take account of 
the fact that we are approximating a discrete distribution by a continuous one. 
This is particularly relevant when n is near the limit for justification of the 
normal approximation; it will tend to correct the length of the interval which 
would otherwise be too short. 

The circumstances under which we may proceed to these various stages of 
approximation are not easily described in a concise manner. Some indication 
has been given in the discussion above. The principal determinants are the 
values of n and of NP and NQ relative to N. Some further details, with 
numerical illustration, are given by Cochran (1977, Chapter 3). 

a 
_ 2.12 Choice of sample size in estimating a 
“a proportion 


Consider the effect of the form (2.16) for Var (p). Clearly this will be a maximum 
when P = Q=3}, so that for a given sample size n we will be able to estimate 
P least accurately when it is in the region of 4. This effect is more fully assessed 
by considering the value of /PQ (reflecting the standard error of p). For 
1< p<i, /PQ only varies over the range (0.433, 0.500), and little change 
occurs in the accuracy of the estimator p. P needs to be in the region of 0.07 
(or 0.93) before the standard error is reduced to 50% of its maximum value. 


Example 2.5 


Suppose we wish to estimate P with an s.r. sample large enough 
for the standard error (S.E.) of the estimator p to be no more than 
2%. How large a sample would be needed? This will depend on 
what we mean by ‘no more than 2%’. Is this a statement about the 
absolute value of the standard error, or do we require the standard 
error to be no more than 2% of P, 1.e. are we concerned with the 
relative value of the standard error: with the ratio of the standard 
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error of the mean: that’s, the coefficient of variation? The results 
will be quite different in the two cases! 
(i) For an absolute value, we want 


S.E.(p) = VPQ/n <0.02 


(assuming that the population is large so that the f.p.c. can be 
ignored). 
(ii) For a relative value, we want 


S.E.(p)/ P =VQ/nP <0.02. 


When P is large the required sample sizes are similar in the two 
cases (for P =0.98 we have n= 49, n = 52, respectively, in cases (i) 
and (ii)). For small values of P the required sample sizes are 
markedly different (49 and 122 500, respectively in cases (i) and 
(ii), when P=0.02). 

Clearly, when we prescribe a relative value for the standard error 
of p, the required sample size inevitably increases consistently with 
decreasing values of P. It becomes unreasonably large for most 
practical purposes, in spite of the fact that the accuracy requirement 
(a 2% coefficient of variation) is not particularly stringent. And yet 
this is likely to be the type of requirement we would impose in 
situations where interest centres on estimating by Np the total 
number of individuals, R = NP, in the population who possess the 
defining characteristic. To keep the sample size manageable when 
P is small, we may well have to be content with a coefficient of 
variation somewhat larger than the 2% considered above. The 
marked differences in the effects on required sample size of absolute 
and relative specifications of accuracy are shown (for the 2% 
specification of accuracy) in Figure 2.2. 
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Fig. 2.2. Required sample sizes for ‘2% accuracy’ in Example 2.5 
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The choice of a sample size to ensure certain limits on the standard error or 
coefficient of variation is of course equivalent to our earlier interest in achieving, 
with prescribed probability, a specified absolute, or proportional, accuracy for 
the estimator itself. 

Thus to choose n to ensure that 


Pr(|p— P|>d)=a, (2.20) 
Pr(|p — P|> €P) Sa, (2.21) 


requires (using the normal approximation and ignoring the f.p.c.) choice of n 


7 


It's all late 
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so that 

(i) S.E.(p)=VPQ/n<d/z,, 
or 

(ii) S.E.(p)/P=VQ/nP <€/zz, 


respectively. 

In practical situations we must again recognise that the standard error, or 
the coefficient of variation, of p will not be known precisely since they depend 
on P, the quantity we are estimating. However, one facility we now have which 
was not present when estimating Y or Y; is that we can place an upper bound 
on the sample size required to achieve a required absolute accuracy in the 
estimation of P, whatever the value that P happens to have. 

To satisfy (2.20) we need 


n> PQz2/d’. 


But PQ has a maximum value of }, when P =}. So that taking n = z,,/4d? will 
certainly satisfy (2.20). Furthermore, this will not be too extravagant a policy 
over a quite wide range of values of P, say (0.30< P <0.70). No similar facility 
is available if we want to ensure a certain proportional accuracy—see Figure 
2.2 above. 

Several loose ends remain to be tied up. It may be that the f.p.c. cannot be 
ignored (since the sampling fraction may need to be sizeable to ensure the 
required accuracy). Then again, we may be investigating a rather rare (or 
rather common) attribute in the population, so that the implication of assuming 
that P =} in determining n to satisfy (2.20) will be to grossly oversample the 
population. Finally we may want to ensure a prescribed proportional accuracy. 

Consider first of all the effect of retaining the f.p.c. and using the exact 
form (2.16) for Var(p). To satisfy (2.20) we need (assuming the normal 
approximation to be justified) 


Zi 2 =) 
n= nfi+ QB 1) (=) 
PQ z., 
or, putting (d/z,)’= V, 
PQ 1 (PQ A 
ve) 1 = 
V x V :)] (2.22) 
So as a first approximation to the required sample size, we have 
P. 
No = < 


which is just what was obtained above when we ignored the fp.c. 


If no/ N is not negligible, then we must use the more exact expression (2.22) 
to obtain 


n=no{1+(no—1)/N}". 
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Similarly, from (2.21), for a required relative accuracy, we need 


n= N[1+(N=1 ~All 


2 Ol 2 4 1[Q/z\" aa 
(a) boas) -IP 2 


(as obtained above for the required sample size when ignoring the f.p.c.), and, 
more accurately, 


n=no{1t+(no—1)/N}. 


In conclusion we must take account of the fact that P will not be known 
precisely (otherwise the survey would be pointless), so that (2.22) and (2.23) 
are not directly applicable. Again the methods (i), (ii), (iii), and (iv) of Section 
2.6 must be considered as means of providing some ‘advance estimate’ of P 
for the purpose of determining the required sample size. 

A pilot study will yield a preliminary estimate of P—this essentially combines 
methods (i) and (iv). The upper bound, provided by assuming that P = 4 when 
absolute accuracy is of interest, also illustrates the use of method (iv). Again, 
and this is by no means uncommon, allied surveys (at an earlier time or on a 
related topic) may provide a reasonable idea of the value of P, as may published 
statistics on a wider front. 

For example, suppose a large nationally based survey was conducted two 
years ago to investigate family wealth. It showed that 9% of families possessed 
more than 1 car. If we decide to estimate the same measure on some particular 
local community we might choose to act as if P is somewhere in the region 
of 0.09 in order to determine a required sample size. Needless to say, we must 
ensure that the concept of the ‘family’ is a similar one in the two cases, and 
take note of any particular social structure in the local survey which may 
distinguish it from the national one (the two-year time separation is also 
relevant, of course). 

Or again we might use two-stage sampling with the first-stage estimate of P 
employed to guide the choice of sample size for the second-stage (augmenting) 
sample. Specifically, if the first-stage sample size is 1 with estimate p,, we 
should take a further n—n observations to satisfy 


p,(1—p;) , 3—8pi(1— pi) , 1-3 pi = Pi) 
ee er 
V pill —p;) n,V 
to achieve a desired variance V, estimating P by p+ V(1— 2p)/[p(1— p)] where 
p is the usual full-sample estimate. Alternatively, we would take 
jo 3 1 
EE er 
CPi pi(1— pi) C np, 
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to achieve a desired coefficient of variation C in which case we estimate P by 
P=p-—Cp(1—p) where p is the full-sample estimate. (See Cochran, 1977, 
Chapter 4, for more detail.) 


Example 2.6 


Suppose, in the context of Example 2.3, that a somewhat liberal 
view is taken of ‘casual holidays’, recognising that the work is of 
such a nature that workers will feel the need to ‘take the odd day 
off on the spur of the moment’. Up to 3 days in six months is 
regarded as reasonable; we want to estimate the proportion of 
workers taking more than 3 days off in the six months of study. 
From the figures given in Example 2.3 we find 


p = 88/1000 = 0.088, 
and obtain a 95% confidence interval for P, from (2.19), as 
0.071< P<0.105 


(Introducing the continuity correction yields 0.070 < P <0.106). 
The more accurate form obtained from solving the quadratic 
equation is 


0.072 < P<0.107. 


These results show some minor discrepancies. But these are 
hardly of any practical importance. Even ignoring the f.p.c. in (2.19) 
gives a very similar interval: 0.0704< P<0.106. 


2.13 Further comments on estimating proportions 


Sub-populations 


In the industrial example on ‘casual holidays’ (Examples 2.3 and 2.6), it is 
likely that quite different patterns of behaviour exist for different groups of 
workers. Different average numbers, and total numbers, of days ‘casual holiday’ 
may have been taken by the workers in the sub-populations for these different 
groups of workers. The population we have examined in Examples 2.3 and 
2.6 is the combination of all the sub-populations; our estimates or inferences 
relate to this aggregate. But we may well wish to study the characteristics of 
the sub-populations separately. What group of workers takes the most, or least, 
number of days ‘casual holiday’? What is the behaviour pattern for some 
particular group? 
This principle extends to estimating means, or totals. 
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Thus if there are k sub-populations with means Y,, totals Y;7, variances S?; 
or proportions P; (i=1,...,k), we may wish to estimate these characteristics 
separately. If we take advance note of this interest and can take simple random 
samples from each sub-population separately, the methods described above 
and their corresponding properties will apply to each sub-population. 

But suppose we have taken a simple random sample from the aggregate 
population. How are we now to study the sub-populations? Subsequent assign- 
ment of observations to the sub-populations does not yield s.r. samples in these 
sub-populations—consequently the earlier results do not apply. The problem 
is that the sample sizes are not predetermined, but are themselves random 
quantities subject only to a constraint on their sum. More detailed study is 
now needed of how to analyse the data appropriately, in order to estimate the 
Y., Y:7, S?, or P;. Some of the results of such study are described by Cochran 
(1977, Chapter 2). 

Such structured populations present a complementary possibility. It may be 
that in recognising the different natures of the sub-populations, and sampling 
these sub-populations separately, we might be able to increase the efficiency 
of our study of the aggregate population. This is indeed true under certain 
circumstances, as we shall see later (Chapter 5) when we consider the ideas 
of stratification, and of stratified sampling. 


Multiple categories and multiple attributes 


We have considered at some length the estimation of the proportion P of a 
population falling into some category. The classification of members of the 
population was a dichotomous one—each member was either in the category 
of interest, or not. Often a more complex classification exists, and population 
members must be assigned to one of several categories. For example, in an 
enquiry into social structure we may want to estimate the proportions of people 
in different social classes, where of course there are more than just two 
possibilities. Or again in the industrial example above, we might wish to 
estimate the proportions of workers in the different occupational groups. 

Thus instead of a single proportion P (and the complementary proportion 
Q=1-P not in the category of interest) we may have proportions 
P,P eee in k distinct categories (k > 2) that must be estimated from the 
appropriate assignment of the individuals in a simple random sample of size 
n. The numbers, 1, %2,---,5"%:, in the different categories now follow the 
natural extension of the hyper-geometric distribution of Section 2.11, but again 
if the n,; are small in relation to the population totals N,, No,.--. Nx we can 
reasonably approximate this distribution. In this case we would use the mullti- 
nomial distribution 


n! 
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(with the constraints ), m =", 5 N,=N and Y P;=1) where 0 
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used the binomial distribution, as the basis for drawing inferences about 
| ID See 

Needless to say, if we are interested only in a single category 1 even when 
k>2, then the results of Section 2.10 to 2.12 are directly applicable. We have 
merely to put P= P; and Q=1-FP,. ’ 

If we have several attributes A, B,... which are simultaneously of interest 
(each perhaps with more than two categories) we enter the more complicated 
realm of multivariate measures (and complex surveys: see Section 3.6). Some- 
times, but rarely, the attributes are independent and can be examined separ- 
ately. Usually they are dependent and this simple prospect does not arise: 
consider, for example, the sex and the extent of knowledge of first-aid, of the 
respondents to the Civil Defense survey of Section 1.1. 


Proportion or ratio 


Consider the Agriculture example of Section 1.1, concerning sales of fruit. 
Since fruit is a perishable product, some of it will not be sold because it is 
not of ‘merchantizable’ quality. Suppose we survey the weights of different 
fruit bought by greengrocers, and subsequently sold by them. It could be of 
interest to estimate the proportion of the fruit bought that is actually sold. 

This is quite different however to our earlier idea of a proportion—where 
each observation is unambiguously classified as being in, or not in, some 
category with regard to the value of a single measured variable. For each 
sampling unit (e.g. a greengrocer) we now measure two quantities: purchases 
and sales (Y and X). We are interested in the population measure of their 
ratio. If we define a new variable R = Y/X, then estimation of the sample 
mean R is just a further example of estimating a population mean (but not a 
population proportion) from a s.r. sample of values of the derived single 
variable R, and the earlier methods apply. But we could proceed differently, 
seeking to estimate the population measure Y/ X, perhaps using j/x. Now we 
are using the two quantities on each sampling unit. Our next topic for 
methodological study will be the estimation of ratios (in Chapter 4), after a 
more detailed excursion in Chapter 3 into some of the practical problems of 
conducting a sample survey. 


5.2 Exercises 


2.1 Two independent s.r. samples of sizes 200 and 450 were chosen one 
after the other (without replacement) from a population of 2400 students in 
a non-residential College. Each student was asked the distance (in miles) from 
the College that he or she lived. The sample means and variances were 


yi=5.14, yp, = 4.90, 


$7 = SBT toting 4,00, 
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Calculate an approximate 99% confidence interval for the mean distance 
from the College that students live. 


22 Consider the ‘casual holidays’ problem described in Examples 2.3 and 
2.5. From company computer records it might be easy to obtain precise 
information on the number of workmen who missed no workdays over the six 
months period of interest. Suppose that 49.82% of the workforce of 36 000 
missed no work. An s.r. sample of 500 men out of the remainder yielded the 
following results. 


‘Days off 1 meee 4k § (C6 eee 10 
No.ofmen 157 192 90 31 18 5 24 90 1 


Estimate the total number of days ‘casual holiday’ taken over the six month 
period; and determine the approximate standard error of the estimator. Do 
the same calculations for the s.r. sample of size 1000 described in Example 
2.3, and explain the discrepancies in the approximate standard errors in the 
two cases. 


2.3. Ina private library the books are kept on 130 shelves of similar size. 
The numbers of books on 15 shelves picked at random were found to be 


28, 23, 25, 33, 31, 18, 22, 29, 30, 22, 26, 20, 21, 28, 25. 


Estimate the total number, Y7, of books in the library, and calculate an 
approximate 95% confidence interval for Yr. 

Suppose the resulting estimate is not accurate enough; we want to be 95% 
sure that an s.r. sample estimate of Yr is within 100 of the true value. How 
many shelves should be included in the sample? 


2.4 Ans.r. sample of size 2n is chosen from a finite population of size 
N(N>2n). The population mean and variance are Y and S’, respectively. 
The sample is divided into two equal parts: the first n observations, and the 
second n observations. The sub-sample means are y; and j,. Derive a simple 


unbiased estimator of S? based on n, y;, and y2. 


2.5 A residential area has 5000 private houses. We want to estimate the 
proportions of houses with 

(a) more than three persons living in them. 

(b) more than one car owned by the occupants of the house. 

The estimators are required to have standard errors not exceeding 0.02 and 
0.01, respectively. From other surveys it would appear that the proportions, 
for (a) and (b), will lie in the ranges 0.35 to 0.55 and 0.10 to 0.20, respectively. 
The two proportions are to be estimated from a single s.r. sample. How large 
a sample is needed to meet the accuracy requirements? 


2.6 We wish to conduct a survey on a sensitive issue, e.g. to estimate the 
proportion P of individuals who have ever used narcotic drugs. An s.r. sample 
of n individuals is chosen from a population of size N. Each is asked to 
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answer, with respective probabilities a and 1 — a, either question A or question 
B, below without revealing which question they have answered. 

Question A: Have you ever used narcotic drugs? 

Question B: Is your birthday in April? 
Of the n sampled individuals, n, respond ‘Yes’. Assuming that one in twelve 
of the population have birthdays in April, show how you might estimate P. 
Discuss the effect of different choices of the value of a, and explain any 
possible practical advantages of such a sampling scheme. Compare your result 
with the extreme (but more usual) case where a =1: that is to say where 
question B is not used. 


(This is an example of a ‘randomised response’ method, or of the ‘principle of 
the irrelevant question’). 


3 


Carrying out a sample survey 


Given that we have decided to conduct a sample survey or opinion poll on a 

given theme, and of a given structure and size, by what means should we decide 

° how to ask the questions in a manner which is clear and not misleading, 

e how to encourage people to answer our questions (and to do so accurately), 

e how to make contact with those we wish to survey, 

* how to ensure that special-interest groups do not refuse to respond (and 
possibly thereby bias the results), 

* how to decide what it is ‘right and proper’ to ask, 

* how to assess whether we are likely to achieve our statistical aims in terms 
of unbiasedness and accuracy (or even cost). 

and so on. 

Such, often related, questions are just a few of the practical (not directly 
statistical) matters which we must consider when moving from the ‘drawing 
board’ to the ‘streetcorner-—indeed they also need to influence the final 
‘drawing board’ (or design) decision on what type of survey, with what 
sampling scheme, and of what size is needed. 

Before extending our study of statistical design and analysis of surveys in 
subsequent chapters, it is useful to examine the range of practical problems 
that arise in respect of the points made above and to breifly consider how we 
might be able to deal with them or minimise their effects. The material covered 
in this chapter could constitute a complete book in its own right. Indeed it 
has done so on different occasions, and particular examples which usefully 
serve to extend our brief study over most topic areas are Hoinville, Jowell, et 
al. (1978) and Moser and Kalton (1971). Selected references to wider discussion 
of particular topics are given in the appropriate sections. Some of these (such 
as questionnaire design, telephone and postal surveys, effects of non-response 
or errors in response, OF complex surveys and variance estimation) have 
also—as we shall note—received extended textual coverage often as the subject 


of a whole book. 


3.1 Sources of variation and error 


This is the starting point: what can go wrong, and why? Suppose that we have 
made broad decisions on how much to spend on a survey and on its statistical 
form. Before embarking on the survey we must obviously see if it is feasible 
to collect the data in the way specified by the design and what difficulties may 
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be encountered in the process. This pre-survey examination (more fully 
described in the next section) will provide feed-back information from which 
we can if necessary mould the design into a feasible form. 

The areas in which difficulties might be encountered can in fact be readily 
categorised. A detailed coverage is given by Groves (1989). 

Suppose we are again interested in some variable Y taking values 
Y,, Y>,..., Y, over a finite population of size N, and in a characteristic of 
the population expressed as a summary measure of Y, e.g. the mean Y. To 
know the whole population Y,, Y2,..., Yx is to know everything about the 
population. We have seen that this is an unrealistic prospect and that we must 
seek instead to draw inferences (e.g. to estimate Y) on the basis of a (probabil- 
ity) sample y,, y2,---, Yn Of size n< N where it is assumed that each observa- 
tion y, is a distinct member Y; of the population of Y-values. 

Obviously we hope that the sample provides a reasonable reflection of the 
population: in particular that it reflects the natural variation in the set of 
Y-values. This variation, sometimes called sampling variation (or sampling 
error) is inescapable: households do have different income levels, industrial 
companies different quarterly sales figures. Why is this important? We have 
already seen that sampling variation (expressed perhaps in terms of the 
population variance, S*) affects the accuracy with which we can, for a given 
sample size, estimate a characteristic such as Y. Thus an s.r. sample of size n 
yields an estimator j with variance (1—f)S*/n; the smaller the sampling 
variation as reflected by S* the smaller, of course, is this variance. In practice 
we can estimate this sampling variance from our sample by (1—/f)s*/n for 
post hoc assessment of accuracy of estimation. 

But we might want to specifically design the survey to achieve a prescribed 
degree of precision. This requires some advance knowledge of the value of S” 
and was discussed for simple random sampling in Sections 2.4 and 2.6 above. 
We also noted in Section 2.6 some of the ways in which an advance estimate 
of S* might be obtained to guide the choice of sample size, n, to achieve a 
prescribed degree of precision. These included use of pilot studies, consider- 
ation of the results of earlier ‘similar’ surveys, possible links between S? and 
Y implied by the physical situation and double sampling, (or two-phase samp- 
ling) in which our s.r. sample is made up of two s.r. samples, the first used to 
estimate S* and hence to determine an appropriate overall sample size to 
satisfy a given precision requirement (see Section 2.6). We shall consider pilot 
studies in more detail later. 

‘Similarity’ and ‘model-structure’ assumptions need to be cautiously applied, 
as often the two situations being contrasted are not as similar in practice as 
might be expected (e.g. short-term changes in social attitude can markedly 
affect personal expenditure patterns) or the expected links do not exist. 

Thus it is important to be able to assess the extent of the natural sampling 
variation both for design of a survey and for interpretation of its results. 

It is interesting to consider an extension of the idea of double sampling, 
to that of replicated sampling in which a sample of size n=cm consists of 
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c separate independent sub-samples each of size m and chosen according to 
the sampling design of the overall survey. 

Suppose we choose to estimate a population characteristic 6 by means of 
an estimator @ (e.g. estimating Y by 7). Obviously, we need also to determine 
or estimate both the expected value and variance of 6 to assess how good it 
is as an estimator of 0. In the simplest case of estimating Y by j from an s.r. 
sample then E(y)= Y and Var (p)=(1—f)S’/n so that all we need is an 
estimate of the population variance S*. But for more complex survey designs 
and more complicated estimators we may not know the theoretical form of 
E(@) or Var (@). 

The replicated sample enables us to overcome this difficulty, Suppose 
each sub-sample yields an estimate 6; breil, 255.0. m) and we put 6 =); 6;/c. 
Then we can clearly (trivially) estimate E(6) by 6 and an estimate of Var (6) 
is then directly provided by 


(8) =¥ (6 —6)?/[e(e-1)]. (3.1) 


The great advantage is that we can use this approach for any form of estimator 
and any sampling design. The choice of c is important and must balance the 
needs for 
(i) many sub-samples to ensure high precision [via (3.1)], and for 
(ii) sub-samples to be large enough to accommodate the sampling design 
(e.g. many strata: see Chapter 5). 

Jackknife methods extend this approach. Here we do not consider all c 
sub-samples of size m, but c overlapping samples of size (c —1)m = n—m made 
up of the whole sample dropping each of the sub-samples in turn. 

Kalton (1977) gives an interesting review of such ideas: see also the compre- 
hensive coverage of variance estimation in complex surveys by Wolter (1985). 

Of course, natural variability in a population (sampling variation) should 
not be thought of as ‘error’. It is an intrinsic and unavoidable feature of the 
population. But errors can occur in the sense that we do not manage to achieve 
what the chosen sampling design requires. There are essentially three types of 
such error: 
¢ non-inclusion errors 
® non-response errors 
e observation errors. 

Non-inclusion errors, or coverage errors, arise when it happens that specific 
members of the target population cannot possibly arise in the sample. For 
example, a city street-corner interview survey is held on a Saturday afternoon 
when an important football match is taking place. This would be likely to 
greatly distort the accessible population and (depending on context) could 
seriously bias the survey results. Players and fans would not be encountered— 
but neither would individuals confined to hospital or prison! Then again a 
telephone survey cannot possibly cover population members who do not 
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possess a "phone. Basically, non-inclusion errors arise because of a serious 
mismatch of target population and sampling frame. It is tempting to say that 
such an obvious fault should not arise, but it is not always easy to anticipate 
all effects that might lead to the inaccessibility of some population members. 

Obviously the bias arising from non-coverage errors will be greatest when 
there is a major distinction in the values 6.) and 6) respectively, of some 
characteristic of interest, 6, in the covered (accessible) part of the population 
and in the non-covered part. If |@-)— | is small, this implies that the 
non-coverage criterion is not closely correlated with the value of the characteris- 
tic and the effect on the bias of an estimator of 6 will be less serious. For 
example, the absence of the Saturday-afternoon football fans may have much 
more effect on the variable ‘age’ than on the variable ‘weight’. 

Non-response errors are, as the name suggests, errors that arise from the fact 
that population members may be included in the surveyed sample but do not 
yield a value of a variable of interest, Y. Often we are simultaneously observing 
many variables; non-response may occur on subsets of the variables (e.g. age 
and income are not declared) or on the complete set of variables (e.g. the 
questionnnaire is not returned, in spite of reminders). 

Non-response errors can arise for a variety of reasons related to the nature 
of the information sought (e.g. facts or opinions), to features of the individuals 
in the population (e.g. persons, administrative units, industrial components), 
and to the method (or event time) of seeking to collect the information (e.g. 
interview, questionnaire, telephone or postal enquiry). Such factors are often 
interrelated: a member of a university group may be particularly reluctant to 
answer a personal question over the phone but might be more receptive to a 
face-to-face enquiry). 

Often the extent of (complete) non-response is regarded as a measure of the 
lack of success of a survey but this is a dubious principle. Non-response may 
reflect unwise decisions on how, where, when and in what manner to try to 
collect the information and to some extent it can thus be controlled by sensible 
choice of operational procedures. But it is also recognised that different subjects 
of enquiry and different methods of collecting data inevitably produce different 
levels of non-response. Furthermore the non-response rate may not relate 
directly to the extent of the error that is engendered by the non-response. The 
resulting loss of sample size will of course inflate the variance of estimators, 
but the degree of bias will depend on how typical (or atypical) is the non- 
responding group of the population as a whole. If the variable Y is strongly 
correlated with tendency for non-response (e.g. perhaps the more highly paid 
will be least inclined to reveal their incomes) we might expect to encounter 
serious bias a result f non-response. We can illustrate these effects. 

Suppose we seek an s.r. sample of size n, but obtain r responses and n—r 
non-responses. We might model the population as consisting (in principle) of 
R_ potential responders with characteristics Yr and Sp and N—R non- 
responders with characteristics Yg and Sx. If we could assume that the 
responses comprise an s.r. sample of size r from the population of R potential 
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responders, then if y is the s.r. sample mean, 
E(y) = ee 


and the bias is thus (Yr — Y) =(N—R)(Yr-— Y)/N. So the extent of the bias 
depends both on the non-response rate (N—R)/N and on the difference 
between Yr and Yg. Furthermore 


Var (yp) =(1—r/R)Sp/r (3.2) 


and the accuracy of estimation also depends both on the non-response rate 
and on the difference between Yr and Yg. In the most favourable case, where 
non-response is unrelated to the variable Y, we have Yp = Ye= Y and Sk=S’, 
and we have to compare (3.2) with (l—n/N )S?/n. But n/N is likely to be of 
similar order to r/R, so that (3.2) reflects an increase by a factor n /r. That is, 
non-response has effectively just reduced the sample size from n to r and the 
loss of accuracy is expressible in terms of the (observed) non-response rate. 

In practice, of course, the most favourable situation is unlikely to prevail 
and the comparison becomes more complicated. The notion of the population 
divided into two distinct groups (strata) of responders and non-responders is 
an example of a stratified population (Chapter 5). 

Some commentators have attempted to classify types of non-response (see, 
for example, Chapter 13 of Cochran, 1977). 

The most difficult forms of non-response to cope with are personal non- 
responses due to refusal to co-operate (Cochran’s ‘hard core’); this is par- 
ticularly influenced by the enquiry method (e.g. postal or personal interview) 
or even the very way in which an enquiry is framed. We will consider this in 
more detail below. 

Failure to find selected sample members is another source of non-response 
and ‘follow-up’ or ‘call-back’ can help to remedy this. There is overlap here 
between non-response and non-coverage. Is the unlocated sample member 
really part of the sampling frame? This is partly a matter of definition. Consider 
the missing football fan in the city-centre, Saturday afternoon, street-corner 
survey! 

Then there is the located sample member for which the required information 
is just not available, possibly because it is not relevant to that individual. 
Consider a traffic survey recording numbers of different types of vehicle at 
different sites over an observation period. No lorries are observed at one site, 
but lorries are banned on that observed section of road. Do we record ‘no 
lorries’ or ‘non-response’ or what? This is a common type of problem: distin- 
guishing between a zero-response and a non-response. Suppose a question 
‘How many hours television did you watch last week?’ elicits a dash(—). Does 
this mean ‘I didn’t watch TV last week’ or ‘I’m not telling you!’ or ‘I do not 
have atelevision’? It is particularly prevalent on self-completed questionnaires. 
Failure to enter an answer is always a source of possible ambiguity and careful 
survey design is essential to minimise its influence. 
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Coverage errors and non-response errors are both examples of non-observa- 
tional errors: errors due to non-observation. Equally serious are observational 
errors, where we obtain information from the chosen sample member but that 
information is faulty. This can happen in a variety of ways. A question may 
be misleading or wrongly expressed and lead to an incorrect response (inter- 
viewer error or question error). An answer, although correct, may be wrongly 
recorded (recording error), or subsequently wrongly coded (coding error) or 
wrongly entered into a data-base (transmission error). These errors are not the 
fault of the sample member. In contrast an individual may give an incorrect 
reply to even a well-posed question either deliberately to conceal information, 
or due to confusion not anticipated by the interviewer or survey analyst. Such 
response errors occur if questions 
* concern sensitive issues 
¢ invite incrimination 
¢ are over-detailed in structure 
e are psychologically ‘loaded’. 

We shall examine these matters more in discussing questionnaire design and 
methods of collecting data in Sections 3.3 and 3.4 below. 

One further general class of errors needs to be considered. Termed measure- 
ment error (or perhaps intrinsic error) it arises when there is a specific value 
Y; relevant to the individual i but it is physically difficult to observe without 
an additional superimposed error. Suppose Y is the pulse-rate of a patient. 
We measure it on the ith patient as 78. Does this mean Y;=78? Not really, 
since there will be error arising from the inaccuracy of measurement, and due 
to variation from moment to moment by natural (intrinsic) effects. Whilst in 
principle we might think of a unique Yj, in practice we observe a random 
quantity: a random variable. The result we obtain might perhaps be usefully 
thought of as an observation y,; of a random variable Y; with mean Y, 
so that 


y= TE 


where e¢; is the combined error effect of measurement and natural variation. 
This is not an uncommon situation. Consider another example. In a traffic 
survey we want to study average daily flow rates over a year from different 
vehicle classes and types of road. Suppose Yj is the flow rate for vehicle type 
i for the kth site of a type-j road. To observe Yj, (which is uniquely defined) 
we would have to sit at the roadside for the whole year. In practice, we will 
take counts over a much shorter time and seek to estimate Y;, from the 
limited-time data. Thus we will use a value 


Vii = Yije + ix 


where e;, reflects the estimation error. 


We are now entering a branch of statistics which combines finite population 
methods with general random-variable-based statistical inference. For further 
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study the treatment in Sections 13.8 et seq of Cochran (1977) provides a useful 
introduction. See also Groves (1989). 

To summarise the above description of sources of error we have the following 
categorisation. 


General type Forms Specific versions 


Sampling variation 


Sampling error 


Non-observational errors 


non-inclusion error 
coverage error 


non-response error 


interview error, 
response error 


question error 


ei error, 


Observational errors 2 
coding error 


ee of measurement, 


measurement error - cca! 
intrinsic error 


3.2 Pre-survey sampling 


Many of the practical difficulties of conducting a designed survey can be 
assessed and evaluated by a modest amount of pre-survey sampling. The data 
collected at this stage can help to determine a number of critical factors, such 
as 

potential sources of measurement error 

likely non-response rates 

sensitive issues or sources of ambiguity 

interviewer inconsistencies 

difficulties of access to chosen sample members 

extent of variability (or some other characteristic) of some variable of 
principal (or secondary) interest. 

No simple prescription can be given for the required extent of, or the method 
of collecting, such preliminary information other than to stress the need for 
appropriate randomisation and for avoiding obvious sources of unrepresenta- 
tivity. How much pre-survey sampling is needed will depend on the range of 
problems which need preliminary investigation, as well as how much time and 
money is available for it. 

Usually, pre-survey data are used to ‘polish’ the sampling design (e.g. choice 
of sample size) and its method of operation (e.g. modification of questions, 
instruction of interviewers, need to allow for various follow-up or ‘reminder’ 
stages). These data are not usually included as part of the main survey data 
to be analysed, although an exception might arise in the case of double 


(two-phase) sampling (or replicated sampling ). 
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Preliminary work and pilot studies 
The principal tool in pre-survey work is the pilot study. This can take many 
forms depending on its purpose. 

At one extreme we may need to select just a few individuals on whom to 
try out different approaches to seeking information. This is particularly impor- 
tant on sensitive (personal) issues or where topics are complex and likely to 
be misinterpreted. Extended, unstructured and in-depth interviews with 
individuals or groups can provide useful preliminary information on which 
to design a questionnnaire or ‘approach procedure’ for the main survey— 
perhaps even to indicate the potential relative merits of the different ways of 
seeking to obtain the data (see Section 3.3 below). 

Provisionally chosen sets of questions and methods of sampling and access 
then need ‘polishing’ by means of ‘pre-tests’ of distinct aspects of the survey. 
For example if we are to conduct a survey on attitudes to local government 
spending based on face-to-face interviewing, we might need to test a section 
concerned with expenditure on ethnic issues. Will individuals understand the 
questions and interpret them properly; will interviewers be able to communicate 
adequately with the respondents; will different interviewers need different 
amounts of time? 

Such matters of unstructured design work and pre-tests are discussed in 
more detail by Hoinville et al. (1978, Chapter 2) and Moser and Kalton (1971, 
Chapter 2). 

At the opposite extreme we may need to conduct a minor version of the 
whole survey. The pilot study now takes the form of a pilot survey. Such pilot 
surveys, often based on random selection and possibly consisting of from 50 
to (even) 500 individuals, can be very important for the successful operation 
of a sample survey or opinion poll. A pilot survey can be used for prior 
estimation of possible response levels, and to elucidate the potential form of 
response errors. In turn this enables more informed decisions to be made 
about the necessary sample size for the main survey and about the likely need 
for follow-up enquiries. It can also highlight the ways in which non-response 
or response error might be related to population features and could thus lead 
to biased and unrepresentative results. It enables useful preliminary com- 
parisons to be made. For example, different covering letters could be compared 
in a postal survey or even different methods of data collection. In the civil 
defence survey of Section 1.1, a pilot survey was used not only to examine 
non-response bias but to compare the cost efficiency of a postal survey with 
a more structured ‘drop-and-post-back’ approach (see Section 3.3 below). 

The pilot survey can also be used to obtain advance estimates of important 
summary characteristics of the population, such as the variance of a crucial 
measure (see, again, Section 2.6). 

Other uses relate to more sophisticated sampling procedures that we shall 
be considering later, including ratio and regression estimators and Stratified 
ion 6p ae . ane ak former case, as well as sampling observa- 

serve a correlated (concomitant) variable X. 
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Particular advantages in terms of efficiency of estimation of Y, say, can arise 
if we know the population value, X. As an extreme example of a pilot survey 
we might be able to observe all population values of X, at modest cost, and 
hence evaluate X precisely. 

Consider an industrial survey in which we are interested in an advance 
estimate of the mean level of sales Y in a particular sector for a financial 
accounting period just ended. Suppose full sales figures have to be returned 
in due course and are listed (as values of X) for the previous accounting 
period. It could be easy and cheap to use all these (as our ‘pilot survey’) to 
derive X, and hence to enable a ratio estimate of Y to be obtained from a 
sample survey of Y-values in advance of the reporting stage. 

This is another example of double sampling (or two-phase sampling); here 
we take our survey data (at least as far as X is concerned) as a subsample of 
a previously chosen sample: albeit in this case the whole population. 

With stratification, a pilot survey can provide vital information on which to 
base decisions on how to divide up the population (into its strata), on the 
difficulties of sampling the different strata and even on what sample sizes are 
needed in each stratum in the overall survey. (See Chapter 2 of Moser and 
Kalton, 1971, or Chapter 12 of Cochran, 1977, for more detailed treatment of 
double sampling.) 

In all pre-survey. activity, the notion of fieldwork is central: we must know 
how data collection methods will work out in practice. Will interviewers by 
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able to achieve their objectives?; Do they need regular supervision?; Will it 
be feasible to obtain access to minority components of the population?; etc. 

Pre-survey methods also have a vital role to play in estimating the likely 
differential and overall costs of carrying out the main survey. 

Finally, the pre-survey stage is the one at which ethical factors are given 
proper consideration. These concern the relationships between the survey 
organisation and the respondents on the one hand, the clients on the other. 
Matters of confidentiality and intrusion are highly relevant as are the implica- 
tions and requirements of formal legislation such as the Data Protection Act 
in the UK. Cunliffe and Goldstein (1979) review a conference on this important 
topic carried out by the Royal Statistical Society. 

It is important to note how disaggregation may sometimes lead to a breach 
of confidentiality. If we disaggregate down to a level of fine detail even in a 
fairly large survey we may finish up with some cross-classifications containing 
only one or two sample members. Although not ‘named’ they can on occasions 
be readily identified from their presence in the particular cross-classification. 


3.3. Methods of collecting the data 


There are two distinct elements in the planning of a survey. One is the choice 
of statistical sampling design, whether it is just s.r. sampling or one of the 
more complicated schemes that we shall be considering later. The other is how 
we should carry out the designed survey: in the sense of collecting the data 
in accordance with the chosen sampling scheme. There are many possibilities, 
depending on the subject matter of the survey and the practical environment 
in which it will operate. We shall review the possibilities, covering specifically: 
e recorded information 

¢ observation 

° face-to-face interviewing 

* postal enquiries 

¢ telephone interviews 


Use of recorded information 

The squirrel mentality is a feature of modern society! We collect everything, 
including information on all aspects of personal, scientific and societal action 
and interaction. This information is recorded and may be available at-different 
levels of detail, and with differing precautionary conditions, for the survey 
sampler to make use of. 

Such data may have been compiled for administration and (local or national) 
governmental purposes, e.g. lists of names and addresses of members of the 
population (as in electoral rolls), levels of trading activity of companies of 
different types, 10-year National Census returns, vehicle licensing details. More 
specifically and more personally, information is held in medical records, police 
files, income tax returns, credit-worthiness evaluations, etc. Special groups 
keep membership details: from sociologists, through solicitors to soroptomists. 
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Finally, individuals keep their own personal records: the scientist of his 
research work (or, being human, of his bank account, letters to a close friend 
or of his collection of silver teaspoons!). 

In principle all such information could be useful—but will it be? Its use is 
limited by many factors. In particular, 

(i) it may not really coincide with the target population of our survey, 

(ii) it may not be up-to-date, 

(iii) it may be too aggregated to be useful, 

(iv) it may be inaccessible for formal or legal reasons, or merely to protect 

the citizen (e.g. under the Data Protection Act), 
(v) groups or individuals may not co-operate in providing details that are 
sought. 

Thus recorded information has limited value at the sampling-unit level 
beyond perhaps confirming (rather than providing) certain aspects of descrip- 
tive detail, e.g. sex, address. Occasionally its usefulness goes further, such as 
in an ‘internal’ survey where an organisation samples its own records to survey 
specific aspects of the characteristics of its members or internal structures. 

Where it can be particularly useful is in possibly providing a list or sampling 
frame on which to base the survey, or aggregate population details (sizes of 
sub-populations, orders of magnitude of means and variances of variables) to 
aid the design of the survey. 

In recent years there has been a growing tendency for organisations to pass 
on (sell) details of their recorded information to others for survey design 
purposes, e.g. a mail-order company may offer to sell a list of its customers 
and a promotional mailing-list to home-insulation companies. This is a practice 
that is not universally liked by the recipients of unsolicited mail! 


Observation 

It is natural to consider obtaining survey data merely by observing what is 
going on, without any need for communicative interaction. That is to say we 
look, rather than ask! This is a standard procedure for the scientist but may 
be relevant also to social or economic investigations. Although not widely 
used as a systematic principle for obtaining survey data, it has its role to play 
and we should briefly consider some advantages and disadvantages. 

The advantages include objectivity, accuracy and avoidance of response 
errors. 

Even in a social enquiry about the way in which people might react to new 
situations, direct observation can be useful. For example in a marketing survey, 
a company might introduce a new product in trial areas and directly observe 
through points of sale how it is received (rather than asking by means of 
interviews how customers would like the new ‘green’ soap powder if it was 
introduced). 

Again, scientific or technical data directly observed avoids the possible lack 
of understanding or knowledge on the part of respondents: e.g. to whether 
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their cars have servo-assisted brakes. Note that ‘observation’ might sometimes 
usefully be implemented (or augmented) by reference to published material. 
The car manuals will say if the brakes are servo-assisted, but they will not, of 
course, reveal the extent to which different models feature in the target 
population. 

Sometimes direct observation might be the only possible approach. Consider 
surveys on weed infestation in wheat crops, or abundance of species in rivers 
or ponds. We might need to go and look. Special approaches could be 
encountered in these examples. Quadrat sampling in biological surveys consists 
of randomly designating a sample area and counting what it contains, perhaps 
by throwing a standard one-metre square light wooden frame to define the 
sample region. Inverse sampling might be used for studying rare events: samp- 
ling at random until a determined number of events are obtained and noting 
the random number of trials needed to achieve this outcome. Capture—recapture 
methods involve taking a sample (e.g. of fish in a pond); marking them, releasing 
them and sampling again to observe how many marked individuals are 
obtained. 

Direct observation can also greatly reduce the effects of non-response or of 
bias generated (often inadvertently) by the interviewer or responder. 

Disadvantages are obvious. Direct observation can be time-consuming and 
highly costly; it can also affect the nature of the very behaviour patterns we 
are hoping to observe. Also, it may not be possible. We can hardly expect to 
‘camp out’ in households to watch and identify traits, attitudes or behaviour 
patterns—although this is not an unknown feature of anthropological or 
sociological surveys, particularly in closed communities of people or animals. 


Face-to-face interviews 

This is acommon means of gathering survey data, particularly in opinion polls 
and attitude surveys. Armed (usually) with an appropriate questionnaire which 
needs to be well constructed (see Section 3.4 below), interviewers go out into 
the population and complete the questionnaire on behalf of individuals ran- 
domly selected according to the survey sampling scheme. 

For successful and valid application this requires careful design of the 
questionnaire, fieldwork and evaluation by means of pre-survey work, engage- 
ment and training of interviewers (except for very small surveys), monitoring 
of their performance and assessment of the potential response of those inter- 
viewed to face-to-face encounter. (Of course, not all of these factors are unique 
to the personal interview approach.) 

It might be expected that this method will minimise misinterpretation and 
encourage high (if not full) response rate. A particular form is what is known 
as quota sampling (see also Section 5.6) where interviewers are instructed to 
constrain their ‘random choice’ of subjects by ensuring that specific numbers 
of particular categories of respondent are obtained. These constraints might 
be designed to achieve a particular stratified sampling scheme (as fully discussed 
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in Chapter 5), e.g. a day’s task might require 20 male and 20 female subjects; 
in each case subdivided into 10 in each of social classes A and D, each of the 
10 including 6 of age up to 30 years and 4 older than 30 years. 

The potential advantage is that this can reduce (or even avoid) non-response. 
With street-corner interviewing, subjects will be approached until quotas are 
filled. But this has its dangers! How are the subjects to be chosen ‘at random’? 
The interviewer has to decide if a potential respondent seems likely to be male 
and over-30 and class D! This implies a degree of subjectivity of choice, rather 
than random selection from the target population. At the end of a long hot 
day this can be particularly problematical. Then again the avoidance of non- 
response can itself engender response bias. How are we to be sure that those 
who agree to be interviewed do not have special attitudes that may be related 
to the very enquiries we are making? 

Other difficulties in personal interviewing include perceived exposure or 
vulnerability, as well as general resistance to being interviewed, or a ‘quiet-life’ 
response (‘yea-saying’) on the part of the potential respondents. Thus, ques- 
tions that may be acceptable in a private response situation may be resisted 
in a personal interview. Total refusal is also encountered ‘on principle’. There 
is also the temptation to give responses which are thought to be required, or 
to be safe, rather than being true. The psychology of the responder-interviewer 
interaction is most complex. 

An obvious feature of personal interviewing as a means of collecting survey 
data is its inevitable expense, through all stages of planning, training and 
implementation. It must be much more costly to send an interviewer Out to 
ask someone questions, rather than looking up the information in records, 
sending out a questionnnaire by post or making a ’phone call. The justification 
must rest in factors such as ease of access, reduction of non-response and 
misinterpretation and (perhaps) the speed with which information is obtained. 
These are perceived to be especially important in opinion polls, and market 
research. 

Needless to say, a personal-interview survey does not inevitably consist of 
casual street-corner encounters. A properly designed survey will often include 
a clear prior specification of the set of randomly chosen sample members to 
be interviewed with details of how to find them. This will involve determining 
a route and means of travel and a procedure to follow for those who are not 
found. Major cost implications enter into this aspect of the planning. In this 
mode, non-response cannot be avoided and call-back can be expensive. 

Another major problem is that of interviewer errors. The tendencies of 
respondents to give distorted answers (self-protection, self-esteem, ‘easy-life’) 
can, if we are not careful, be compounded by the interviewer. Ideally, the 
interviewer should not interact other than to provide neutral (non-subject- 
related) comments. This does not always happen. Biases can arise because of 
personal reaction of the respondent to the interviewer: prompting an ever 
greater desire to appear in a good light or, in contrast, a perverse intent to 


conceal or even mislead. There is also the risk of sheer misunderstanding by 
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the interviewer of an answer that has been given. Sound and thorough training 
of interviewers is essential to minimise these difficulties. . 

Chapter 12 of Moser and Kaltan (1971) and Chapter 5 of Hoinville, Jowell 
et al. (1978) provide interesting further details on interviewing as a basis of 


collecting survey data. 


Postal surveys 
An obvious alternative to the personal interview, and probably the most 


frequently used method, is to send out a questionnnaire by post with a request 
that it is completed and posted back to the survey operator. 

A hybrid approach of some interest is the ‘drop-and-post back’ method, 
where questionnaires are delivered personally and left for completion and postal 
return. This has the potential advantage that the personal contact might 
engender interest and co-operation, enable an opportunity to explain difficul- 
ties and identify inaccessible potential respondents; with associated possible 
gains in response-rate and reliability. Of course, it is again an expensive 
approach (almost as expensive as personal interviewing) and only occasionally 
proves to be cost-effective. 

With a postal or mail survey, all the usual design and planning procedures 
need to be followed (except usually for those involving interviewers). There 
may be an initial temptation for the recipient to throw the questionnaire ‘in 
the bin’ and efforts are needed to reduce this prospect. Thus, a well-constructed 
and persuasive cover letter can be useful (promising confidentiality), a stam- 
ped-addressed envelope is almost essential (it’s hard to throw away an unused 
stamp) and tangible incentives may even be offered: ‘return this form to claim 
your ball-point pen (or to be entered in our raffle)’. 

An important class of surveys does not in principle need artificial means to 
encourage response. There are the many official surveys, where government 
(or other administrative or professional) agencies expect or even formally 
require their enquiries to be answered. Thus, companies may be required (if 
selected) to submit returns on their last-quarter trading to the relevant govern- 
ment department. But we should not imagine that this guarantees a full (or 
high) response rate. Government surveys seem in fact to encounter quite 
noticeable levels of non-response for various reasons (including perceived 
complexity of enquiry, or work load) and are regularly monitored and modified 
to seek to reduce this. 

Postal surveys can lead to response rates as low as 50% or less. They 
inevitably require ‘follow-ups’ with reminders to non-respondents and often a 
further questionnaire and stamped-addressed envelope in each case. Often 


two reminders are sent. An old ‘rule of thumb’ claimed similar response rates 
at each stage: e.g. 


Stage Overall response level 
First mailing 40% 
First reminder 64% (40% + 40% of 60%) 


Second reminder 78.4% (40% + 40% of 60% + 40% of 36%) 
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But it is of course dangerous to generalise! In the civil defence survey of 
Section 1.8, 3000 questionnaires were sent out. The initial response level was 
54%. A reminder raised this to 72%. This is quite a high response rate for this 
type of enquiry and not really in line with the ‘rule of thumb’. 

Other than reduction of costs, other advantages of postal surveys are the 
elimination of interviewer errors and the opportunity to cover more detailed 
issues (perhaps requiring consultation between individuals or scrutiny of 
documents). But they are not necessarily a particularly speedy means of 
collecting data. Several months can elapse and carrying out (say) two 
reminders. 

Chapter 11 of Moser and Kalton (1971) and Chapter 7 of Hoinville, Jowell 
et al. (1978) give interesting discussions of postal surveys. More detailed 
treatments are by Erdos (1970) and Dillman (1978). 


Telephone surveys 
With the rapid proliferation of telecommunications it is natural to think of 
trying to contact individuals (persons, organisations, institutions) in a survey 
‘over the ’phone’. This is not such a recent idea. Even in the 1930s, in the 
United States, political opinion polls were being conducted this way but with 
some notable failures. One problem was that only about one third of the 
households had a telephone. The effects are clear: low coverage of the popula- 
tion and high-response bias because of the inevitable non-representativity, in 
terms of social and economic characteristics, of these households which had 
telephones. 

The situation is of course very different now in coverage terms, with (perhaps 
many) more than 90% of households having a telephone in most developed 
countries. It is not surprising, therefore, to find increasing use of the telephone 


to contact selected sample members in social surveys, opinion polls and market 
research. 

The potential advantages of the telephone survey are obvious. It will be fast 
and cheap. Disadvantages include the non-coverage point outlined above. 
Even with 90% coverage, this will clearly not be random. Households without 
telephones are likely to reflect certain social and economic characteristics 
which in turn will be under-represented in a telephone survey sampling frame. 
Thus we can find noticeable non-coverage bias and this needs to be evaluated 
and assessed in each case. 

Other difficulties echo these in the personal interview approach and relate 
to inter-personal effects between the telephone interviewer and respondent. 
This includes misinterpretation and misunderstanding of questions and 
answers, as well as emotional attitudes of protection, esteem or over-compli- 
ance (this latter often giving rise to what is seen to be the ‘easy’ answer, 
particularly in yes/no questions: phenomena known as ‘yea-saying’ and ‘nay- 


saying’). As with all questionnaire-based enquiries it is essential that the 
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questionnaire is designed to minimise such artificial responses (see Section 
3.4 below). Equally important is the training of the telephone interviewers, 
following careful and detailed fieldwork and pre-survey investigations. 

A feature which is probably more prevalent with this approach than with 
any other is the prospect of outright refusal to co-operate. People seem to 
particularly resent what they see as ‘invasion of their home’ by the telephone 
enquirer and special skills are needed on the part of the telephone interviewer 
to try to get past this first barrier. 

In spite of the problems, and because of the obvious time and cost advantage, 
ever-increasing attention is being given to telephone survey usage. This is 
reflected in an expanding literature, even with whole books devoted to the 
subject. An interesting recent example is Groves ef al. (1988) which surveys 
the present situation in its various aspects. 


Other methods 
For particular types of enquiry, there are other methods of data collection. 

The keeping of diaries is a useful method when extensive information needs 
to be accurately compiled over a period of time. Typical examples are in 
household expenditure surveys and television (or radio) audience measurement. 
The disadvantages are the burden placed on the respondent and the expense 
of detailed monitoring and interaction. 

Another time effect manifests itself in what are known as panel surveys (see 
Kasprezyk et al., 1989); where the same set of sample members is maintained 
and observed at different points in time. The aim is to improve the accuracy 
with which we can examine the way in which a population changes over time: 
trade surveys or road-traffic surveys are typical examples for useful application. 

There is a basic distinction of approach to be drawn between what are called 
cross-sectional, and longitudinal, studies. In the former (a cross-sectional sur- 
vey) a single sample is considered at a fixed point in time (or over a fixed 
period of time) with the aim of estimating characteristics of the population at 
that time. In the latter (a longitudinal survey) similar enquiries are made at 
different times to examine the dynamic development of the members of the 
population. A longitudinal survey in which the same sample is used at each 
time is a panel survey, but there is no need for a longitudinal survey to have 
total coincidence of sample members at each stage. A diary survey is not 
usually longitudinal, e.g. a TV audience measurement survey is interested in 
the passage of time (a few weeks perhaps) as a basis for accumulating 
information rather than observing changes over time. A series of such cross- 
sectional surveys (repeated surveys) does of course show temporal development 
of population characteristics (e.g. average hours of sport watched per week). 
But compared with the case where the sample is fixed (and we then have a 
longitudinal panel survey) it does not enable us to so readily and accurately 
estimate population characteristics of change over time, e.g. saturation (or 
adulation) effects in the watching of soap operas! 
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3.4 Questionnaire design 


It will be clear from our discussions above that much of the success of a survey 
requiring objective and subjective responses from individuals and organisations 
rests on the skill with which a questionnaire has been constructed. So many 
factors can affect, and distort, the answers that are given. Misunderstanding is 
an obvious problem. Questions must be clear and unambiguous with response 
categories carefully chosen. This is easier said than done! Misunderstanding 
takes many forms. A question can just be too ‘technical’: the concepts are 
unfamiliar. It is easy to overestimate the vocabulary of the respondent. Consider 
the following question: 


Q. The government’s efforts to reduce inflation have not been successful, it should 


persevere with these methods. 
Strongly agree? Strongly disagree? 


This could be a disaster area! The responder may not know what ‘inflation’ 


means—even ‘persevere’, which may be taken to mean ‘discontinue’! But so 
many other problems arise with this question. It states aypolitical viewpoint. 
Suppose the responder thinks that inflation is being reduced; this could modify 
the response (in anger, in defence?) compared with the answer that might be 
given to a neutral version of the question statement; e.g. The government should 
continue with its efforts to reduce inflation. Here we see the prospect for attitudinal 
bias in the response. 

What about the prospect of a neutral response? The respondent is given no 
opportunity to ‘sit on the fence’ but must either agree or disagree. Do we want 
to force a directed response, or do we admit the valid prospect of no view 
either way? 

In the latter case, we need an additional (centrally placed) neutral category, 
appropriately labelled. What about the labels? Should they go from ‘strongly 
agree’ to ‘strongly disagree’ or vice-versa? And what about possible effects of 
the ‘? on each label, or even the different sizes of the response boxes? 

An interesting forced and false response example was encountered over 50 
years ago when responders were asked to give their views on which of a number 
of specific actions should be taken by the United States government in reaction 
to the international ‘Metallic Metals Act’. About three-quarters of the respon- 
dents expressed firm views for action in spite of the facts that no such Act had 
ever been passed or proposed and there was an explicit ‘don’t know’ option. 
Presumably this was a marked example of self-protection: individuals trying 
to hide their ignorance from the enquirer. 

Thus there are major influences on the likely reaction of respondents to 
questions. They may modify their answers to protect themselves or to comply 
with what they feel is expected, or even (occasionally) to deliberately confuse. 


Do you: 
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We can seek to anticipate these effects and reduce them by the design of the 
questionnaire. The types of questions (see below) and format for answers can 
be influential. 

Personally sensitive questions present another problem area. If they are not 
central to the enquiry but required for classification of groups of responders 
we can sometimes use proxy questions. Rather than asking about social classes 
or income level, a few proxy questions on newspaper reading habits and 
location or type of residence can give a fairly clear estimate of these characteris- 
tics. This is not foolproof of course but may be adequate and enables the 
direct question to be avoided. If it cannot be, we have to use other methods. 
One is the randomised response approach. When first used it sought to estimate 
abortion rates by asking at random whether the person had had an abortion or 
was born in a particular month. The respondent knew that the enquirer did not 
know which question was being answered: minimising resistance to answering 
the question, avoiding legal reporting obligations but allowing for accurate 
estimation of the abortion rates (see Exercise 2.6 at the end of Chapter 2). 

Other major factors that can irrelevantly affect, and possibly distort, 
responses are: 

e the position of a question, on the questionnaire, and relative to other 
questions, 

e the very wording of the question, 

¢ the interest of the respondent in the topic, 

e need for memory in answering the question. 

In Kalton et al. (1978) we see some interesting examples of the influence 
of question wording on question response. One simple example conveys a 
number of signals. In a survey of attitudes to local planning matters, respon- 
dents were asked ‘Are you in favour of giving special priority to buses in the 
rush hour?’. 69% were in favour. Ina repeated survey with different respondents 
those in favour dropped to 55% when the question was extended with the 
words ‘or should cars have just as much priority as buses?’. 

Questionnaire design is very important and is a major subject of research. 
Much has been written on it and the few remarks above can be usefully 
augmented by examining Chapter 3 of Hoinville, Jowell et al. (1978) or 
Chapter 10 of Groves (1989). An extended treatment in relation to attitude 
surveys is given by Schuman and Presser (1981). 


3.5 Initial processing of the data 


It is all very well to collect the survey data accurately in accordance with the 
chosen statistical (probabilistic) sampling scheme, but what should we do with 
it when it is to hand? The simple answer is that we need to construct the 
required estimates of population characteristics, together with estimates of 
their accuracy. With a modest-sized sample (a few hundred) and a few items 
of information, there is usually no difficulty. Most samples, however, arise 
from complex surveys, with large sample size and a multiplicity of inter-related 
measures: some qualitative or quantitative, some attitudinal or factual. Major 
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statistical problems arise in estimating marginal and inter-relational charac- 
teristics and in estimating accuracies of estimates, e.g. their standard errors 
and correlations (see Section 3.6). Before this stage, however, there are more 
basic decisions to be made about the organisation and recording of the data 
for subsequent analysis with special attention nowadays to the requirements 
of computer systems and packages. 

These matters include dealing specifically with: 
e scaling 
° editing 
* coding 
¢ tabulation 
e missing data 


Coding and scaling 

At the planning stage of a survey, discussions will have to be made on how 
to record the collected data. There are basic considerations, such as whether 
to maintain the data as completed questionnaires to be checked and analysed 
‘by hand’ or whether to enter the results into a computer for detailed analysis. 
The latter is the more usual (and often the only feasible) approach. Data on 
questionnaires may be of two types: structured responses, to questions in which 
the respondent has to answer within a set of permitted categories (‘place a 
tick in the relevant box, or boxes’) or unstructured responses, to question where 
a personal narrative response is invited (‘What do you think of the management 
structure?’). 

For data processing (checking, editing, analysis and presentation) it is 
important to code the responses in a convenient manner. With factual structured 
questions, and particularly for computer storage and analysis, it is convenient 
to use numeric codes. Thus to an enquiry about the respondent’s sex, we might 
code ‘Male’ and ‘Female’ as 0 and 1, respectively. Here the numbers have no 
significance; we could have used —2 and 17.1 but this would have been rather 
untidy. For a question on the terminal stage of education the permitted 
responses might be: 


secondary, or high-school 
college 
university (first degree) 


university (higher degree) 
We could code these responses 0, 1, 2 and 3, but the choice is not now an 


arbitrary one. There is a natural ordering which would make 0, 3, 1, 2 rather 
perverse and even 3, 2, 1, 0 perhaps a little counter-intuitive. 
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If a question asks the respondent to state how many children there are in 
the family the coding is a natural one: 2 children is sensibly coded as 2 and 
the actual coded number is now meaningful, since it is the answer to the 
question. A choice might have to be made, of course, about grouping the 
responses. If we retain categories 0, 1, 2 and ‘3 or more’, and code these as 
0, 1, 2 and 3, say, then the coding is no longer fully numerically interpretable. 
Consider, for example, estimating the mean number of children by naively 
averaging the coded responses over all respondents. 

With structured attitudinal questions, a (numerical) coding system will again 
be needed. It might be quite uninterpretable: no logical link between response 
and its coded number. Or it may need at least to reflect a natural order in that 
there is a valid sense in which progression through the coded numbers expresses 
a progression of attitude, e.g. strongly disagree, disagree, undecided, agree, 
strongly agree. 

With unstructured questions, or those with large numbers of permitted 
distinct (and even multiple) responses, it may be necessary to determine coding 
schetnes post hoc: deciding from the pattern of responses what might constitute 
a manageable set of response categories and a coding system for them. 
Obviously enough detail (i.e. a large enough number of categories) must be 
retained to reflect important areas of enquiry. It is unlikely, for example, that 
we could usefully contract ‘number of children’ into 

0: No children 
1: Some children 


Such considerations should make it clear that there are two possible reasons 
for coding. The first is administrative convenience in processing the data, 
particularly by computer. The second is the possible interpretability of the 
numerically coded responses from the point of view of analysis. Even ‘ordered’ 
responses give some advantage in this respect. If the numbers are fully interpret- 
able, e.g. the number of children, last year’s salary, sales last quarter (even if 
grouped into a number of categories in terms, say, of mid-valuees either at 
the response or subsequent coding stage) then clearly they are particularly 
useful for statistical analysis purposes. 

The ability to conduct meaningful statistical analyses is one of the stimuli 
to scaling of data. Its object is to transform a response (or a combination of 
responses to some related questions) to a numerical coding which is interpret- 
able and will justify formal statistical analysis. It presupposes that there is a 
natural scale of measurement on which the qualitative responses (or sets of 
responses) lie, and seeks to determine what that scale is. This is a complex 
field of study and will not be pursued here (but see Chapter 14 of Moser and 
Kalton, 1971) other than to give a simple example of the problem. It might 
seem reasonable to code the responses: 


strongly disagree, disagree, undecided, agree, strongly agree 


as 0, 1, 2,3 and 4. At least there is an implied ordering preserved by this. But 
can we validly process such numerically coded responses in a statistical 
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analysis? Does it make sense to say: ‘the mean response is 2.7?’ This depends 
on the underlying natural scale. To use the numbers 0, 1, 2, 3 and 4 makes 
highly specific assumptions about the relative values of different responses. 
Perhaps 


disagree, undecided, agree 
are really rather close together, in which case 
0, 2, 3-4. 6 


might be a much more reasonable scaling and coding system! 


Editing 

When the data are available there are many processes of checking and editing 
that will need to be conducted. These include the inevitable, vital and time- 
consuming checking for completeness of response: is there an answer to each 
question on a questionnaire? If not, what should be done? Missing data is a 
serious problem and we shall return to it shortly. A fairly common difficulty 
is in interpreting whether non-responses imply no view, inapplicability, 
refusal to answer, or failure to ask the question. Apart from incomplete 
responses, the editing process must seek to highlight any regular (or frequent) 
misinterpretations evident in the pattern of responses, and seek to identify 
specific inaccuracies, perhaps evidenced by contradictory answers, e.g. affirma- 
tion of both ‘no children’ and ‘oldest child is 10 years old or more’. But what 
about ‘no children’ and ‘oldest child is less than 10 years old’? 


Missing data 

Needless to say manual checking and editing can, and often does, go on at 
the same time as coding. But more routine editing is feasible and desirable 
when the data have been coded and entered into the computer. The computer 
can speedily conduct rapid and comprehensive searches for incompleteness 
and contradictions, including ‘out of range’ errors from coding, where a coded 
response is not included in the set for the particular question, and numerical 
responses which are statistically implausible in view of their extremeness 
(outliers ). 

Much attention has been given in recent years to automatically (statistically) 
compensating for faulty responses and missing observations by seeking to 
‘predict’ what might be reasonable ‘substitute responses’. This is particular 
prevalent in large-scale educational and social surveys. Many method have 
been used including the sophisticated ideas of ‘multiple imputation’ (Rubin, 
1987). There are differences of view on the routine use of such methods. Some 
argue that, if a large proportion of the data has to be dealt with, the Process 
must be unreliable; whilst if only a small proportion requires attention It 1s 
probably not worth bothering! It is certain that much more will be heard of 
methods for remedying non-responses and response errors, to the long-term 


benefit of survey sampling. 
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Tabulation 
Before leaving the field of data preparation we should note that almost any 


detailed survey analysis will need to be preceded with a simple tabulation 
(ideally computer-generated) of the survey results. This takes the form of 
frequency tables of the responses to each question, and a judicious choice of 
two-way cross-classifications. 

The aim is informal: to receive broad indications of effects and relationships 
to guide the more formal statistical analyses. This is particularly pertinent in 
the next section. 


3.6 Complex surveys 


These are surveys in which a multiplicity of factors is being investigated, in a 
multiply cross-stratified population reflecting intricate population structure, 
and where accordingly very complicated sampling designs are needed. Of 
prime importance will be the study of relationships and measures of their 
strengths. This may need the use of regression and multivariate methods not 
usually employed in finite-population sampling studies. Skinner ef al. (1989) 
provide details of such an approach. Another basic problem with complex 
surveys is in estimating variances of estimators: a topic specifically and exten- 
sively dealt with by Kalton (1977) and Wolter (1985) who includes modern 
sample re-use procedures such as bootstrap and jackknife methods. A general 
sampling method which has advantages inter alia for estimatiion of variances 
is that of replicated sampling (or taking of interpenetrating samples). Here a 
sample is made up of a set of samples each obtained from the same (possible 
complicated) sampling scheme. Thus we have replicated samples each reflect- 
ing on (interpenetrating) the population, and intercomparisons can shed useful 
light on the sampling behaviour of (possibly complicated) estimators. This 
principle was illustrated in Section 3.1 above. 


4 
Ratios: 
ratio and regression estimators 


We have so far considered the problems associated with the estimation of a 
single population characteristic, based solely on the probability sampling scheme 
of simple random sampling. Continuing with the same sampling scheme, we 
shall now broaden our enquiries a little with respect to the population charac- 
teristics of interest. Frequently (indeed predominantly) the aim of a sample 
survey is to seek information simultaneously on a range of different measures 
in the finite population we are studying. Both our practical interest and the 
cost and effort of conducting a survey demand that we should do so. 

For example, in the social affairs example (a) in Chapter 1, there will be a 
variety of different aspects of the attitudes of 18 year olds to recent legislative 
changes that will be of interest. Responses are likely to consist of completed 
questionnaires concerning such matters. 

Initial contact with the respondents is the major cost and administration 
factor. To restrict attention to a single question (‘Do you feel that the right to 
vote gives you an important say in the organisation of society?’) is both sterile 
and inefficient. (Note also that this example is a leading question which can 
prejudice the reaction of the respondent!) To seek answers to, say, 20 questions 
involves little more trouble than obtaining the answer to one question; it 
provides far greater facility for assessing attitudinal factors and can yield 
wide-ranging information on the population for current or future use. Thus 
we are often confronted with multivariate data concerning a variety of measures 
in the population, represented by variables Y, X, W,.... 

Simultaneous estimation of population characteristics exploiting the correla- 
tion structure of the multivariate population is not a principal part of this 
introductory treatment. However, one simple extension of the univariate situ- 
ation will be considered in detail in this chapter. This concerns the bivariate 
case where we simultaneously observe two variables, Y and X. We shall discuss 
two matters, distinct in aim but involving similar statistical considerations: 

(i) how to estimate the ratio of two population characteristics, for example 

Y¥r/ Xr, mM Te 
(ii) how simultaneous observation of Y and X, exploiting any association 

between these variables, can in certain circumstances assist in the 

efficient estimation of the characteristics of one of them, for example 


Y, of Y% 
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4.1 Estimating a ratio 


In a variety of situations we may need to estimate a ratio of two population 
characteristics: the totals, or means, of two variables Y and X. We will be 
interested in the quantity 

R= Y;7/X7= V/X; 
which we will refer to as the population ratio. 

This interest can arise in two ways. Either the ratio is of intrinsic interest 
in its own right. For example, we may wish to estimate the proportion of 
arable land given over to the growth of barley in some geographic region. To 
this end we might sample farms in the region and record their total acreage, 
and the acreage of barley crops. If these are X; and Y; for the different farms 
in the region, it is precisely R = Y;/Xy7 that we must estimate. 

Alternatively, concern for the ratio R may arise from administrative con- 
venience in the construction of a viable sampling scheme. Suppose we want 
to estimate the average annual income per head, or average number of cars 
per person, for adult persons living in a particular geographic region. We 
might envisage taking a simple random sample of adult individuals, noting 
their income or the numbers of cars they possess (predominantly 0 or 1) and 
using the sample mean in each case to estimate the corresponding population 
mean of interest. But it might not be easy to sample adult individuals at 
random—ease of access to the population, and other quantities of interest in 
their own right, could favour the use of larger sampling units, say households. 
If this is the case, we become concerned with ratios, rather than means. The 
average income per head is now best regarded as the ratio of total income Y; 
to the total adult population size X;, with both characteristics estimated from 
the sample of households. Similarly for the car-ownership enquiry. 

Note two features of this example: the use of groups of individuals (as 
sampling units) in the study of characteristics per individual (this relates to 
the idea of clustering, discussed in more detail later); also the simple nature 
of one of the variables in being discrete and taking only a few possible values 
(or even an indicator variable taking just the values 0 or 1). Both features 
commonly arise in ratio estimation (although the barley example shows that 
a simple discrete form for one of the variables is not inevitable). 

Thus we wish to estimate the population ratio R = Y;/X7;, on the basis of 
a simple random sample (y,,X,),-.., (Yn, Xn) of the bivariate population 
measures (Y;, X;) (i=1,..., N). 

There are various possible approaches to estimation of R. Two immediately 
obvious ones are to use the sample average ratio or the ratio of the sample 
averages. Specifically, these are 


| n 
Wy wie Pp (yi/ Xi) 


n 


and 


2 = ¥/X =yr/Xxr, 
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respectively. We must seek to examine some of the sampling properties of r, 
and rp. 

We often find that Y- and X-values of practical interest are correlated. 
Suppose Y is household expenditure on food and housing, whilst X is total 
household income. We must expect positive correlation between these two 
measures. Furthermore, it seems clear that the presence or absence of such 
correlations will affect the properties of the estimators r, and r,. For example, 
with higher positive correlation the individual ratios Y;,;/ X; will vary little 
compared with a corresponding uncorrelated situation (with S?, and Sx 
common in the two situations) and we might expect this to reflect in the 
precision of the estimators. No correlation, or negative correlation, however 
would seem to be unpropitious (both in terms of likely practical interest and 
the properties of estimators of the population ratio). 


(a) Let us start by considering r, 

In spite of its intuitive appeal, r, is not widely used as an estimator of the 
population ratio R. It is biased and the bias and mean square error can be 
large relative to other estimators, particularly r,. We can readily calculate the 
bias. 

Consider the population of values R, = Y,/ X;. This has population mean R 
and variance S%. Since r; is a sample mean from an s.r. sample, it has expected 
value R and variance 


But typically R is not the same as R, so we have 


bias (r,)=R-R 


a |e mee KX). 4.2 
e y RX X) (4.2) 


This features the covariance Srx =r" RCG= X)/(N-1) between R and X. 
Thus, noting that the mean square error is the sum of the variance and the 


square of the bias, we have 


M:S.E. (r:) = (1 -*) Sait (N= 1)7S (X17). 


We have the usual unbiased variance estimator yeh — 7)*/(n—1) available 
for Sz and we can readily obtain an unbiased estimator of the covariance Srx 
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in the form: 
¥, ri(x; —¥)/(n —1) = n(¥ — 7X)/(n—1) 
1 


Hence (noting that 7=r,) we can estimate bias (r,) and M.S.E.(r,) by means 
of 


—(N-1)n(¥ —14%)/[(n -1)X7] 


and 
(: -*) ¥ (r,—14)2?/n + (N -1)2n(F — 4X)?[(n — 1)?X?] 


respectively, provided X; is known (a condition which will prove to recur 
throughout our study of ratio estimators, and which is not infrequently satisfied 
in practice). 

So if X; were known, we could actually correct r, for estimated bias obtaining 
a modified estimator 


r=n+(N-1)n(y—7,X)/[(n—-1)X7]. (4.3) 


This is known as the Hartley—Ross estimator. 

Another way in which we could eliminate the bias is to sample (with 
replacement) with probability proportional to the X; values—rather than using 
s.r. sampling (see Section 4.5 below). Whilst this is not often possible in precise 
form, we can sometimes devise a sampling procedure which effectively achieves 
it: e.g. sampling from an electoral roll at random will tend to reflect different 
roads with probabilities proportional to their numbers, X;, of registered voters. 

Note the effect of such sampling. We now have 


Ney 


E(n)= E(Y/X)=5 (=! 


1 
and thus r, is unbiased, without the need for any modification. More details 


of this approach will be found in Section 6.4 below in the context of cluster 
sampling. 


(b) Let us now consider r,. 

This estimator is more widely used. Although still biased (and with a skew 
distribution) in small samples, the bias and M.S.E. tend to be lower than for 
r, (although precise claims in this respect are hard to justify and only rather 
limited empirical studies have been made). In large samples the bias becomes 
negligible and the distribution of r, tends to normality, thus enabling inferences 
to be drawn based on a normal distribution with appropriate variance, Var (r;). 

As with r,, we have to deal with the complication that both numerator y 


and denominator X reflect random variation. Let us start again with the bias. 
We have 


r2- R=(y— RX)/X 


4.1 Estimating a ratio 81 


and taking a Taylor series expansion about the population mean X gives 


= Bs = yr-l 
aR (1+=*) 


X 
iS Re) ace 
raul La a 


x- x 
x 
As an approximation to the bias we can take the first two terms to obtain 


y — Rx 1 & 
E(r)-R=E (2) -¥% E[(y — Rx)(x— X)]. 


The leading term is zero since E(y — RX)= Y — RX =0. Furthermore, 


n 


E[(7(x-X)] = Cov (y, X) = (1 «) Syx/n= (1 -*) PyxSySx/n, 


where pyx is the correlation between Y and X. Thus we find, as an approxima- 
tion to the bias, 


(1-n/N) 


E(r.)—R= nx 


(RSX ss PyxSySx ) (4.4) 
which can be gmall if pyx is close in value to RS,x/Sy. This is equivalent to 
saying that the regression of Y on X is linear and through the origin, or that 
Y and X are roughly proportional to each other (see below for more detail 
on this matter in a broader context). 

Suppose we now consider large samples, utilising asymptotic results. We find 
the following approximate results: 


E(r.)= Y/X= Y7/Xr 


and 


(4.5) 


where f is the sampling fraction n/N. a 
These approximate results may be obtained by writing as before 


aie y— Rx 
py / %— ee 
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and replacing x in the denominator by X, which should be reasonable in large 
samples in view of the consistency of the s.r. sample mean. We obtain 


—RE(x)_ Y-RX 
E(r)- R) == OE - . 


=0. 


Also, on this approximation, 
Var (r2) = E[(r.— R)” l= FLV- Rx)’). 


But if we define 
Z, = Y; — RX,, 


then (jf — RX) is just the mean, Z, of a simple random sample of size n chosen 
from the population of Z; values. 
Since this derived population has zero mean, we have from (2.3) 


~ Var (12) = (1—f)S2/(nX?), 
where S2 is the variance of the population of Z-values. 


Thus, 


Zi/(N-1) 
5 ; 
N (Y,—RX,)’ 


ee ay a. pee? as given in (4.5). 


Var (r2) = 


Equivalently, we can make use of standard results on the asymptotic form 
of the mean and variance of the ratio of two statistics. We have that 


i) 
E(2)-2@+00n 3 (4.6) 
and 
y\_Var(y) 2E(y) - a Pe : aye 
Var (2) [E@? [E@> Cov (I, 2) + Egy Var (x)+O(n7-*/?). 
(4.7) 
But 
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so that from (4.6) we again demonstrate that r, is approximately unbiased in 
large samples. The bias in r, is seen to be of order n_' as we have already 
seen in (4.4). From the first-order terms in (4.7) we can again obtain the 
approximation (4.5), using the fact that 


Cov 9,2) = cov (y, x)=, 


where the covariance, Syx, of the bivariate population of (Y;, X;) values is 
defined as 


| = . 
Syx =“No1- Ce Y)(X; aa (4.8) 


The variance of r, is of order n ' as we should expect, but we note that, in 
general, the approximation is correct to order n°”. If the bivariate finite 
population manifests a roughly normal form, the error term in (4.5) is of order 
n> (in line with results for infinite bivariate normal populations), and the 
approximation for Var (r2) is correspondingly more accurate. 

But (4.7) also yields an alternative form for (4.5), namely the approximation 


Low 
nX? 


Var (r2) = {$2 —2RSyx +R Sx}, (4.9) 
which explicitly includes the population covariance, Syx. 

Further details on the adequacy of the approximate form for the variance 
of r> are given in Cochran (1977, Chapter 6). 

Once again we encounter the familiar problem that the variance of our 
estimator is expressed in terms of population characteristics, which will be 
unknown. Thus we will need to estimate Var (r) from our data, and it is usual 
to employ the direct sample analogue 


(=f) y (y; — i)" 


—_— 


Z 
(75) — a 
(12) nid onl 


This differs from (4.5) by a term of order . 
The sum of squares 


Ginn) 


is most conveniently calculated as 


¥y y2—2n Y yt L x, 
i=1 i=1 i=1 
echoing the alternative form (4.9). Note how it is unnecessary to correct the 
y; and x; values for their sample means, owing to compensation effects. 
We remarked above that the exact distribution of r, is most complicated, 
but that it approaches normality in large samples (when sampling from large 
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populations). Thus, for large samples, we can construct confidence intervals 
for R. If s(r>) is the sample estimate of the standard error of r,, [Var (r2)] ; 
we have an approximate 100(1—a@)% symmetric two-sided confidence interval 


for R in the form 


Ty — ZqS(T2) < R< 124 Zq5(T2)- 


Example 4.1 


A daily newspaper conducts a survey of food costs by taking a 
simple random sample of 48 basic foodstuffs purchased in a large 
supermarket. Prices (in pence) for these items are recorded on two 
separate occasions, three months apart, the earlier ones being 
denoted x,, the later y,. The sample ratio, r, = y/X gives an indication 
of the change in basic food prices over the three months period in 
the form of an estimate of the population ratio R of the mean 
prices of all foodstuffs on the two occasions. 
The following results were obtained: 


y=12.07, *=11.41; 
¥ y7=92706,.. 3x7 =84317, .... ¥, ya 8564-1. 
Clearly the population size will be vast in relation to the sample 
size n = 48, so that we can ignore the f.p.c. 
So the estimated ratio is 1.06: a 6% rise in prices over the three 


months in question. The approximate sampling variance of the 
estimator is 


587.0 
48 x 47 x (11.41)? 
so that we have an approximate 95% confidence interval for R as 
0.970< R<1.145. Clearly any firm statement of an average food 


price increase would be unwise in the light of this approximate 
interval (the wide range reflects the small sample size in the survey). 


= (0.0447)? 


4.2 Ratio estimator of a population total or mean 


Suppose we wished to estimate the total local government expenditure in 1989 
on some particular service (health or education, say; let us in fact choose the 
provision of recreational facilities for children). We might decide to do this 
by sampling the different county and metropolitan authorities throughout the 
country, making specific enquiries on such expenditure in a simple random 
sample of these local authorities. 

Clearly there is going to be large variation in the amounts spent on rec- 
reational facilities by the different authorities. This will reflect many factors, 
including their land area, number of inhabitants, available budgetary resources, 
and rural, urban, and industrial breakdowns. Most of the variation in provision 
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stems of course from the differences in sheer size of the authority. We may 
have available a deal of information on the relevant factors in the population. 
If so, it would surely be desirable to make use of our knowledge of the structure 
of the population to assess the representativeness of any random sample we 
may draw, or to guide the choice of the sample in an attempt to obtain a more 
efficient estimator. 

We shall be considering at length in the next chapter the concept of 
stratification (the division of the population into non-overlapping groups, or 
strata, which represent its structure), and its use in the construction of what 
we hope will be better estimators than would be obtained from a simple random 
sample drawn from the unstratified population at large. 

At this stage, however, we will consider an alternative means of exploiting 
known elements of the population structure in certain circumstances. This 
consists of the use of ancillary quantitative information to construct what is 
called the ratio estimate of the population total (or mean). It seems reasonable, 
in the local authority example above, that expenditure on recreational facilities 
for children should change from one authority to another roughly in proportion 
to their number of inhabitants, or their total annual budgets. (Some slight 
anomalies may be observed in particularly small or large authorities, or for 
largely rural or industrial ones, but the pattern of proportionality overall is 
likely to appear quite strong.) 

Suppose Y, denotes expenditure on recreational facilities for authority i, X; 
its number of inhabitants, and we sample both measures simultaneously, at 
random from the whole population, to obtain an s.r. sample of size n: 
(y1, X1)> +++ (Yns Xn)- The total number of inhabitants for the whole population, 
X;, is likely to be known fairly accurately (for example, from census returns); 
we will also know the number, N, of local authorities in the population. But 
we could have estimated X; from the sample by means of the estimator 

X7 = Nx, 
where x is the s.r. sample mean. Similarly we could estimate the total expen- 
diture Y; (the characteristic of principal interest) by 

yr = Ny. 

The estimate x7 has no interest in its own right (since we know Xr), but it 
has the important advantage that by comparing it with the population charac- 
teristic X,; we can informally assess the representativeness of the sample. If 
x; is very much less than X,, then in view of the rough proportionality of Y; 
and X; we would conclude that y7 is likely to underestimate Y7; if x7 is too 
large, so is yr likely to be. If the proportionality relationship were exact we 
would have 

YR 1,..., NN), (4.10) 
where R is the population ratio, Y7/Xr or Y/X, discussed in the previous 
section. Thus, 


Yr — RX7, 
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and we could estimate Y; by replacing R with the sample estimate, r2, to 
obtain an estimate of the population total, Y;, in the form 


(4.11) 


T 
VIR r2X7 =— jJr- 
XT 


N.B. We shall not consider using the sample average ratio r, to estimate R 
in this context, but will restrict attention to the ratio of the sample averages, r2, 
which henceforth will be denoted r (i.e. suppressing the subscript). 

The estimator yp is called the s.r. sample ratio estimator of the population 
total. Note that this achieves precisely the type of compensation we require 
for values of x; which are fortuitously larger, or smaller, than the known 
value, X;; it reduces, or increases, our estimate of Y; accordingly. 

The exact case is discussed simply to motivate the estimator (4.11)—if (4.10) 
held, one observation (y,;,x,) would determine R precisely and hence 
Y;(=RX7,), so that ‘estimation’ of Y7 is trivial. 

If the exact relationship (4.10) does not hold (it is hardly likely to do so in 
any practical situation), the same aim at compensation must still be sensible 
whenever there is ‘rough proportionality’ between the variable of interest, Y, 
and the ancillary (concomitant) variable, X. In such cases we can again use 
the ratio estimator (4.11). 

If interest centres on the population mean Y, rather than the total Y;, then 
similar arguments support the use of the ratio estimator of the population mean, 


m1 | ><) 


Pe c=. (4.12) 
Such ratio estimators have an obvious appeal, but clearly we must attempt 
to identify the circumstances under which we obtain an important improvement 
in efficiency of estimation over the direct s.r. sample total, y;, or mean, j. This 
must involve a clearer statement of what is meant by ‘rough proportionality’. 
A statistically important factor in these enquiries is that the sole sample 
statistic that is used is the sample ratio, r, whose properties have been discussed 
in some detail in the previous section (as r,: the ratio of the sample averages). 
Consider the estimator Yr. From (4.6), Jr is asymptotically unbiased; in 
certain circumstances it is unbiased for all sample sizes, as we shall see shortly. 
From (4.5) and (4.9) we see that the approximate variance of Yr (for large 
samples) is 


See Crt ee 
Vv skates Nt 
ar (Yr) n X N=] (4.13) 
eo 
=*1(53—2RSyx + R’Sx) 
bad 2 2¢2 
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where 


Syx 


p — 
tery Sx 


is defined to be the population correlation coefficient. If the exact relationship 
(4.10) held, then Var (fg) would of course be zero (by (4.13)). In practice this 
will not be so, but Var (Yr) is clearly going to become smaller, the larger the 
(positive) correlation between Y and X in the population. 

For estimating Y; we have analogous results for yrr. It is asymptotically 
unbiased, and has large sample variance 


was) > (Y;— RX;)’ 
n i=1 N-1 


or 


in — 
oe (S& ra 2RpyxSySx “t R’ Sx). 


Again we will need to estimate Var (Fr), or Var (yrr), from the sample, and 
the most convenient forms to use are 


nf ( as Z 4 2 e 4 
-. .). < _—2F xX, +9r xX 3 
n(n-1) 2? es 2X 
and 
U-DN | = 2 : 2: zs ‘ 
= =, x, + 
n(n-1) Ly rd oe Py i 


respectively. Note that this estimation stage must introduce further inac- 
curacies; our only safeguard lies in the size of the sample. 

Using the large sample forms for variances, and exploiting the asymptotic 
normality of the estimators, we can again obtain approximate confidence 
intervals for Y or Y;7 in the usual way. A reasonable practical prescription 
for using the normal distribution, and the approximate form for the variance, 
is that the sample size should be about 40, the sampling fraction no greater 
than 0.25, and that the ratios Sy/Y and Sx/ X should both be less than 0.10. 
These latter quantities are known as the population coefficients of variation for 
the Y and X variables. We shall denote them by Cy and Cx, respectively. 

When the large sample results are inappropriate, the assessment of the 
properties of Yr and yrr, and the construction of confidence intervals for Y 
and Y; using ratio estimators, are most complicated. The exact results are 
incompletely known, and not very tractable. Some approximate results, which 
take account of the fact that the distribution of r frequently has positive 


skewness, are summarised by Cochran (1977, Chapter 6). 
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Let us return briefly to the question of the bias of the ratio estimators. The 
model (4.10) for the relationship between the Y; and X, values in the population 
was of little relevance apart from motivating the form of Yr OF Yrr. We cannot 
expect it to hold in practice; indeed if it did, there would be no estimation 
problem! Relaxing the model slightly we might consider one for which 


Y, = RX;+ Ej, (4.15) 


with ). E; =0, where >, denotes summation over all subscripts i for which 
X, = x. (This is the finite population analogue of the classical linear regression 
model.) 

In this case Y = RX (as required from the definition of R), and in an s.r. 


sample of size n, 


where @ is the sample mean of the E values in the sample. 
Clearly the conditional expectation 


E(élx;,.--;%,)=0 


for this model, so that E(r)=R and we conclude that r is unbiased for all 
sample sizes. But whilst more plausible than (4.10), the model (4.15) is again 
unlikely to be exactly satisfied by our finite population. At best we may find 
that the population is ‘roughly’ of this form, and may consequently be less 
concerned than otherwise about possible bias in r, jr, OF yrr. 

We have examined in some detail the properties of ratio estimators of a 
population mean or total, but a crucial question remains. Under what circum- 
stances, if any, should we use a ratio estimator in preference to an s.r. sample 
mean or total? 

More effort is involved in obtaining Yr (or yr) than jy (or yz), albeit to 
only a slight degree since the major task is in designing and conducting the 
survey. Simultaneous measurement of two quantities, Y and X, often poses 
little more work than measurement of one alone. We can thus rule out 
differential costs as a major distinguishing factor in most circumstances. 

The prime consideration becomes the precision of the estimation principle; is 
Yr(Or Yrr) more or less efficient than jp (or y;). The answer turns out to be 
that either possibility can arise, depending on the population correlation pyx 
and the coefficients of variation Cy and Cy. 

We must identify the conditions under which Var (jz) is less than Var (7); 
that : where the ratio estimator is the more efficient. From (2.3) and (4.14) we 
see that 


Var (Yr) < Var (f) 


R*Sx < 2RpyxSySx, 
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that is, if 


1 Cy 


>--—_ 
Prx ~ 4 ey (4.16) 


So we see that a gain in efficiency is not in fact guaranteed; we need the 
population correlation coefficient to be sufficiently large. (In practice we would 
need to assess the criterion (4.16) from sample estimates of pyx, Cy, and Cy.) 

But notice that however large py x turns out to be, we still need not necessarily 
obtain a more efficient estimator by using Yr (Or yrr). If 


Cx > 2Cy, (4.17) 


the ratio estimator fz (OF Yrr) cannot possibly be more efficient than y (or yr) 
even with essentially perfect correlation between the Y and X values. Thus 
two factors are important for efficiency improvement from ratio estimators: 
the variability of the auxiliary variable X must not be substantially greater 
than that of Y (in the sense of (4.17)), and the correlation coefficient pyx 
must be large and positive. 

Nonetheless, many practical situations are encountered where the appropriate 
conditions hold and ratio estimators offer substantial improvement over y or yr. 

Reviewing the situation we need the following circumstances to hold: 

(i) We must be able to observe simultaneously two variables Y and X 
which appear to be roughly proportional to each other (that is, which 
have high positive correlation). 

(ii) The auxiliary variable X must not have asubstantially greater coefficient 
of variation than Y. 

(iii) The population mean X, or total X;, must be known exactly. 

The ‘rough proportionality’ in (i) implies a more or less linear relationship 
through the origin. The fact that this is through the origin has not been formally 
considered above. Its major importance is a negative one. If Y and X were 
essentially linearly related, but the relationship did not pass through the origin, 
then we might be well advised to consider an alternative estimator known as 
the regression estimator. This estimator is discussed in the next section where 
it is contrasted in terms of usefulness and efficiency with the ratio estimator 


and the s.r. sample mean. 


Example 4.2 


Let us reconsider the Statistics Class data given in Section 1.6. For 
this population, with Y denoting height and X denoting weight, 
we clearly have quite a strong positive association between the Y; 
and X, values, as shown in the scatter diagram, Figure 4.1. Using 
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Fig. 4.1. Scatter diagram of weights (X) and heights (Y) in the Statistics Class 


our privileged knowledge of the whole population we can calculate 
various population characteristics of interest. 


S-=72.61, Syx= 6307, S>=80M4.... py = 0225; 
R=Y/X=1.144, Cy=0.376, Cy = 0408. 


Frequency 
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Fig. 4.2. Histogram of 500 values of Y, from s.r. samples of size 5 in the Statistics Class 
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The ratio of the coefficients of variation is almost exactly unity; the 
correlation coefficient very much in excess of half this value. 
Although the results above refer to large samples we should perhaps 
not be too surprised to find that the ratio estimator of Y, based on 
an s.r. sample of size 5, greatly improves on the s.r. sample mean. 
To investigate this, 500 such s.r. samples have been generated and 
jr evaluated in each case. Figure 4.2 shows a histogram of the 
values obtained. By comparison with Figure 2.1, we see that yp is 
far less disperse than y; consequently more efficient. The mean and 
variance of the 500 values of fr are 23.86 and 4.45 respectively, in 
comparison with 23.89 and 12.52 for the 500 values of y. (The large 
sample approximation to Var (Vp) is 3.81.) 


4.3 Regression estimator of a population total 
or mean 


Another type of estimator which aims at exploiting the relationship between 
some variable of interest, Y, and an auxiliary variable, X, in order to obtain 
greater precision in estimating Y or Y; is the so-called regression estimator. 
This is a particularly useful estimator when (again) there is some degree of 
linearity in the relationship between the Y and X values in the population, 
but this relationship does not necessarily pass through the origin. An exact 
relationship of this type would take the form 


Y, = Y+ B(X,;-X) (4.18) 


for all population values (Yi, Xi) and some appropriate value of B. The 
regression estimator, like the ratio estimator, is applied in situations where 
the value of X is known. In the case of a population satisfying (4.18), we 
could clearly determine Y exactly from a single observation (y, x), since 


Y =y+B(X -x). 


But of course no such precise structure is likely to be encountered in real-life 
problems. What is possible, however, is that the Y- and X-values do seem on 
inspection to vary in a way which reflects a degree of linearity, with relatively 
small superimposed deviations about the linear relationship. Thus, for example, 
we might consider a model in which 


Y,= Y+B(X;-X)+E,, (4.19) 


on the assumption that the E-values have zero population mean and bear no 
systematic relationship to the X-values, and in the belief | ad the population 
variance, Sz, of the E; will be rather small in relation to Sy. The model (4.19) 
is then a useful representation of a population where the variation in Y-values 
may be attributed in part toa linear dependence on the corresponding X -values, 
and in (perhaps lesser) part to population vagaries unconnected with the 
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X-values. If we assume (as suggested above) that Sx- =9, then we see how 
S*. has two components, 


S2-= B’Sy+Sz, 


and the population correlation coefficient is 


So, 
Si = SY(1—pyx), 


and we note that the relative importance of the X- and E-values in accounting 
for the variability in the Y-values depends on the value of p\x. If the Y- and 
X-values are highly correlated (in a positive or negative sense), the E-values 
make little contribution, and vice versa. This mirrors the characteristics of the 
classical linear regression model in the infinite population case. 

Suppose now that we draw ans.r. sample (y,, X;),..., (Vn, Xn) and, knowing 
X, want to estimate Y. It seems sensible that we should take account of any 
linear relationship in the population, by using the estimator 


yi =y+ B(X -2X). (4.20) 


The estimator, j,, is called the linear regression estimator of Y. Similarly Ny, 
is the linear regression estimator of the population total, Y;. 

Let us consider the justification for this estimator. In the extreme case (4.18) 
it is incontrovertible, for 7, = Y. But no estimation problem exists in this case! 
For the more general model (4.19), j, again seems a plausible choice. If B>0, 
then if ¥< X we would expect that p< Y and (4.20) applies a compensation 
in precisely the manner we would wish, to take account of the linear dependence 
of Y on X. See Figure 4.3. 

The same is true if x> X, or if B<0. 


yi 
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Fig. 4.3. Compensation effect of linear regression estimator (B>0; x <X) 


4.3. Regression estimator of a population total or mean 93 


/ Additional justification arises from considering the sampling behaviour of 
yr. It is clearly consistent in the finite population sense, since when n=N, 
j¥_= Y. We see also that 


E(j,) = E(y)+ B[X — E(x)]= Y, 


so that jy, is unbiased. 


Furthermore, 
Var (F,) = E[(p.— Y)"]= E{Ly- Y) - B(z-X)P} 
=*1(s3-2B5yx + B’ Sx) 
1- 
a1! 531-94), (4.21) 


Thus Var (¥,) < Var (¥), and the efficiency of y, relative to y increases with 
ae , that is, with increase in the correlation between the Y- and X-values. 

Thus j, has obvious advantages. It is unbiased for all sizes of sample and 
it cannot be less efficient than y. Also we can obtain an unbiased estimate of 
Var (j,) from the sample, in the form 


owe 


(s3-—2Bs yx + B’sx 9 


where s?,, Syx, and sx are the familiar unbiased estimators of S7, Syx, and 
Sx; 


e.g. Syx = 


But for various reasons the results above are not as conclusive as they might 
appear. In practice, the exact value of the parameter B will of course be 
unknown. Furthermore, the model (4.19), with the additional assumption of 
zero correlation between the E; and Xj, is unlikely to hold precisely; we cannot 
assess its appropriateness without studying the total population of (Y, X ) 
values, in which case we would have no need for sampling or estimation. So 
how are we to relate these results to the practical situation? 

We merely use our study of the behaviour of jy, in the special case above 
to motivate the consideration of a family of general linear regression estimators 
of the form 

jr=y+b(X -*) (4.22) 
(for different possible values of b) as a general principle of estimation. With 
no specific concern for the nature of any relationship between the Yj and X;, 
we might still ask whether an estimator of the form of (4.22) has any desirable 
properties as an estimator of Y. The fact that it clearly does have such properties 
when (Y, X) satisfy (4.19) with Sxe ~( and b=B justifies such an enquiry. 
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We must consider two possibilities: either that b is given some pre-assigned 
value, or that we seek to estimate an appropriate value from the sample. 


(a) Preassigned b 


The sampling properties of y, defined by (4.22) are already known to us from 


study of the special form (4.20) 
We see that, whatever the value of b, y, is unbiased, since 


E(j,) = E(y) + b[X — E(x)]= Y. 


Also, 


bs 


i= 
Var (Fx) = = (Sy — 2bSyx + b°Sx) (4.23) 


with corresponding unbiased sample estimate 


wf 


oihsn (s7—2bsyx a b*s%). 


An obvious question arises: if y, is unbiased for all values of b, for what 
value of b does it have minimum variance? From (4.23) this must occur when 
bSz = Syx =0 


or 


S 
b= bo= Syx/Sz =pyx—, 
Sx 


in which case Var (j,) takes the minimum value 
. ee 1 aati 2 2 
Min Var Ga) mis Sy(1— pyx). (4.24) 
But this is just the same as (4.21), so we conclude that 
. Nae ae 
5+ pyx o (X -%) 
PS 


is the most efficient estimator of Y of the form (4.22), irrespective of any possible 
relationship between Y and X in the population. If the model (4.19) happens to 
hold, then the optimum estimator is precisely the one, (4.20), which we 
considered for that model (and which had the practical appeal of applying 
an appropriate form of compensation to 7). 
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Again, by will not be known in practice so that the optimal estimator is 
inaccessible. However, we may be prepared to assign some specific value to 
b irrespective of the sample. This could happen if other studies of a similar 
nature have been carried out, and we feel fairly confident in transferring earlier 
knowledge to the present situation. Or again, the measures Y and X may be 
such that we anticipate a particular value for the slope of a linear relationship 
between them. In such situations we can use the sample approximation for 
(4.23) to assess the precision of our estimator, or compare it in efficiency with 
the s.r. sample mean, y. We can also construct corresponding approximate 
confidence intervals for Y in the usual way, on the assumption of the normality 
of 7 and x. (Appropriate conditions for this follow from the discussion in 
Section 2.5) 

Furthermore, by considering the sample analogue of (4.24) we can informally 
assess how well our regression estimator compares in efficiency with the best 
possible estimator of the form (4.22). The relative efficiency is 


(1~pix){1-2bprx ($8) +8" (=) i ; 


which, for large samples, may be reasonably estimated by substituting sample 
estimates of Sy, Sx, and pyx. We see that the proportional increase in variance 
due to using a non-optimal value for b is 


— 2 2 
Var (Wi) _ - Pm (2) (4.25) 
Min Var (y,) 1—pyx bo 
where by is the optimal value, pyxSy/Sx. Again this is readily estimated from 
the sample. 


(4.25) has important implications. Non- optimality of choice of the value of 

b can produce serious inefficiency in the regression estimator, relative to 

optimal choice. The relative inefficiency will be greatest in populations where 

Y and X are highly correlated. If the correlation is modest, choice of the 

value of b is less crucial, but then the potential gain over using y is very much 
less. 


Example 4.3 


Consider the expression (4.25) when pyx = 0.6, 0.8, 0.9, 0.95 and 
|1 —b/bo| =0.5, 0.2, 0.1. The proportional increases in variance in 


these situations are: 


Pyx 
0.8 0.9 0.95 


0.5 | 0.141 0.444 1.066 2.314 
|1—b/bo| 0.2 | 0.023 0.071 0.171 0.370 
0.1 0.006 0.018 0.043 0.093 


Thus, for example when pyx = 0.95 or 0.9 even modest discrepancies 
seriously affect the efficiency of the estimator. 
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(b) Estimated b 

If we have no basis for preassigning a value to b, as will most often be the case, 
we must consider how we might use the sample data to suggest an appropriate 
value. The optimal value of b, pyxSy/Sx expressed in terms of population 
characteristics, suggests that we might try using the corresponding sample 
expression 


E (F(x - 3) 
b=—t = ——____——_—., (4.26) 


PT i Aoi 


wn“ 


(This has the same form as the least squares estimator of the classical linear 
regression coeflicient for infinite populations. ) 

Indeed this is a reasonable procedure, at least in large samples. The 
regression estimator of the population mean now has the form 


r=y+b(X -2). (4.27) 


Its distributional properties are difficult to determine precisely in view of the 
presence of the additional random variable b, which itself is a ratio of two 
statistics. 

Large sample behaviour of (4.27) is more easily studied, and we base our 
investigation on the model (4.19). The extent to which a linear relationship is 
present will be reflected in the value of pyx (as discussed above). 

On the basis of this model we can show that the asymptotic forms of the 
expectation and variance of (4.27) are 


E(¥,)= Y+O(n"") 


and 


Var (Fx) =—! $4(1- px) +O(n-™”) (4.28) 


These results are comforting. Firstly we obtain an estimator which is unbiased 
in large samples. Secondly, and perhaps more importantly, we see that having 
to estimate b from the data is no disadvantage in large samples. We will obtain 
the optimum estimator: one with asymptotically as small a variance as is 
possible for this type of estimator. Thus, using b must be preferable to assigning 
some specific value to b, since at best this will yield an estimator with variance 
given by the leading term in (4.28), whilst if we are unfortunate in our choice 
of a value for b we may be faced with a far less efficient estimator. 
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But if all is well in large samples, the same cannot be claimed for small 
samples. The distinction between ‘large’ and ‘small’ samples in this respect 
requires a more detailed knowledge of how the mean and variance of 
y — b(X —xX) depend on sample size and population characteristics. We shall 
not attempt to derive such results; indeed there is much that is not fully known 
about the sampling distribution of the regression estimator. We shall merely 
quote some of the known results. Further details on their derivation and 
implications can be found in the standard texts and research literature. Again, 
Cochran (1977, Chapter 7) is a useful source of information and references. 

On the question of bias, it happens that this becomes serious if there is 
marked evidence of a quadratic relationship between Y and X. Alternatively, 
it is aggravated by excess kurtosis in the set of X-values in the population. 
Correspondingly, the variance is seen to be most affected by the coefficient of 
skewness of the X-values, so that the large-sample approximation will be least 
accurate for highly skew populations (with respect to the X variable). A 
reasonable practical prescription for adopting the large-sample approximation 
for the variance of the regression estimator is that the sample size should be 
in excess of 50, and the set of X-values in the population not greatly skew. 

Again, in use we will need to estimate the variance 


if —_ 

meee (1 —p2x), 
n 

and we will use 


s*(IL) = Li33 — bsyx). 


Example 4.4 


A civil defence survey is to be conducted on the attitudes and 
preparedness of householders in a particular town, to a possible 
civil disaster. A list is available of the 11 200 householders in the 
town, and the survey is to be conducted by personal interviews 
which for a randomly chosen householder will cost £4.50. A total 
of £6 000 is available to carry out the survey and setting-up costs 
are £500. Major interest centres on estimating the population mean 
Y of some variable Y. It is possible to observe, at an additional 
cost of £0.50 per individual, the value of a concomitant variable 
X for which it is known that X = 31. 

A pilot study, and previous experience, combine to suggest 
approximate values for certain population characteristics as 


follows: 


Cy = 0.4, Cx = 0:2; Pyx = 0.65. 
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Should we use a ratio estimator, or s.r. sample mean, if processing 
costs in the two cases would be £200 and £150, respectively? 


Firstly, consider estimating Y by 7 (the s.r. sample mean). We 
have, for sample size n, 


S.E. (¥) =[(1—n/11200)/n]'/?Sy 
Ye) 
S.E. (¥)/ Y =[1—n/11200)/n]'/*Cy 
But using the cost information 
650+ 4.50n = 6000, so that n = 1188. 
Thus for the s.r. sample mean we achieve an estimator (y) for which 
C, =SE(¥)/ Y =0.011 
If we use the ratio estimator Yr = (~/X)X we have for sample size n’, 
S.E. (Fx) =[(1—1'/11200)/n]"/?(S¥ —2RpyxSySx + R*Sx)'”’, 
where R = Y/X. Hence 
S.E. (Fx)/ Y =[(1 —n'/11200)/n'}'/2( C2 — 2pyxCyCx + CX)”. 
From the cost information, 
700+ 5n' = 6000, so that n’= 1060 
and for the ratio estimator we find 
S.E. (¥x)/ Y =0.0091. 


Thus the ratio estimator is the better choice and we need a random 
sample of size 1060. 


4.4 Comparison of ratio and regression estimators 


We have considered the circumstances under which the ratio estimator, jr, of 
Y is more efficient than the s.r. sample mean, jy. This arises if 


RS, cy 
Pyx > 2s, (where R= Y/X). (4.29) 


We saw that, for large samples, 
EE eres oe 
Var (Fr) =—* (Sy — 2RpyxSySx + R7S>). (4.30) 


Correspondingly, we observed that the regression estimator j, (with b 
estimated from the data) has 


u os 2 2 
Var (7)! S¥(1-p9x), (4.31) 


and it cannot be less efficient than j. 
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There remains the comparison of jr and y,. We have 
Vv = = : lak 22 
ar (Yr) — Var (yx) a (R?Sx — 2RpyxSySx + pyx SY) 


Lag 
ite (RSx — pyxSy)’, (4.32) 


and we therefore conclude that the regression estimator must be at least as 
efficient as the ratio estimator under all circumstances. From (4.32) we see that 
the only situation in which the ratio estimator can have the same efficiency as 
the regression estimator is when 


R= px. (4.33) 
x 
But we have seen that pyxSy/S x is the optimum choice b, for the parameter 
b, in the sense that it minimises the variance of 7+ b(X — xX). We saw also that 
it is the inevitable value for the parameter B in the model (4.19). Thus Yr and 
jf, are equally efficient only if 


R=b)=B. 


Let us look a little more closely at the comparison of j, Yr, and y,, and at 
the role of any formal model expressing an element of linearity (or proportion- 
ality) in the relationship between the Y and X values in the population. 

Note immediately that we do not need to make any explicit assumptions about 
a possible linear relationship between Y and X to derive the properties of y, Yr; 
and j, described above. Merely defining fz by (4.12) and y, by (4.27), we find 
that (asymptotically, hence approximately in large samples) they are unbiased 
and have variances given by (4.30) and (4.31), respectively. 

Thus j, is always more efficient than jy, except in the isolated case where 
pyx =0, when they are equally efficient. Also, the relative efficiencies of Yr 
and 7 are governed by the value of pyx with yr being more efficient if (4.29) 
is satisfied (with the prerequisite that Cy, <2Cy), otherwise less efficient. 

Finally, 7, will always be more efficient than jr, unless the special relation- 
ship (4.33) happens to hold in the population, in which case they are equally 
efficient. If (4.33) holds, then so of course does (4.29) and the conditions are 
satisfied for jr to be more efficient than jy. (Otherwise we could encounter a 
contradiction where jy, and jr had the same variance, but y, was inevitably 
more efficient than y, whilst Yr happened to be less efficient than j!) : 

Thus, what are important in this comparison are the relative values of Y 
and X, and of S* and S%., and the correlation coefficient, pyx. We have no 
need to formulate any linear model to express the relationship between Y and 


xX. However, we could choose to set up such a model, and this would serve 


two purposes. Firstly, it would provide a practical motivation for initially 
e have already 


considering types of estimator of the form of Yr OF Yi, aS W 
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seen. Secondly, since pyx is a measure of linear association, such a model 
might help to illustrate more tangibly the comparison of jz and y,. 

With no implied constraints on the bivariate finite population of values 
(Y,, X;) we can freely declare that 


or that 
Y= Kix+2s (4.35) 


with 


It must of course be true that k’= R. 

But with no additional assumptions, such models are sterile; in their uncon- 
strained forms, (4.34 and 4.35), they imply no linearity of relationship, or 
proportionality between Y and X. However complex the pattern of values 
(Y;, X;), this can be accommodated by suitable values E; (or E;) depending 
on the X;. But if we demand that such dependence is not to be entertained, 
say by postulating that 


y E,(X,-X)= y Ei(X,-X)=0, 


the models become much more structured. They now represent linearity, or 
proportionality, with superimposed ‘deviations’ (or ‘errors’) uncorrelated with 
the X values. 

As we have seen at the beginning of the previous section, we must now 
have, in (4.34), 


Also in (4.35) we find, on multiplying each side by (X,—X) and summing 
over the whole population, 


Thus if (4.35), with 
N —_ 
i=l 


happens to be an appropriate model for our population, it can be re-expressed 
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as (4.34) with E; = E;, and we have 
R=k'=k=B. 


But this is precisely the condition that needs to hold for x and y, to be equally 
efficient. 

Hence from the practical viewpoint the tangible justification for using Yr 
will rest on any indication in the data of a linear relationship through the origin 
between the values of Y and X, with no suggestion of correlation between 
the X, and the deviations, E;. So such a relationship does turn out to have a 
formal basis. The likely inferiority of jp relative to j, will be indicated by 
observing that the Y and X values do not seem to be roughly proportional (in 
a positive sense) to one another—or do not appear even to be roughly linearly 
related. 

In conclusion, we can informally summarise the conditions which will 
support the use of ratio or regression estimators in the following way. 

(i) We are concerned with estimating Y (or Y7) ina finite population where 
for each Y, value there is a value X; of some auxiliary variable. 

(ii) Y and X can be simultaneously sampled and the population mean X 
is known precisely. 

(iii)(a) If the data (or general knowledge) suggest some reasonable degree 
of linear relationship between the Y and X values then we can expect to 
obtain a useful gain in efficiency over y (or yr) by using the regression estimator 
Yr (OF Yr = Nj). 

(iii)(b) If the linear relationship has positive slope and appears to pass 
through the origin, we can expect a similar gain, for slightly less computational 
effort, by using the ratio estimator jr (OF yrr)- This saving in computation is 
the only possible advantage of jr over y,, and is only worth exploiting if 
there is a clear indication of proportionality i.e. of a positive linear relationship 
through the origin. 

Note that in our formal discussion and comparison of these estimators the 
sole concern has been for achieving (asymptotically) unbiasedness and 
minimum variance. Whilst we shall not consider here any other criteria of 
choice, we should not disregard alternative prospects. Questions of the cost 
involved in achieving a certain precision are most relevant and may lead to a 
principle of balancing cost against precision. We touched on this in Example 
4.4. In our study (in the next chapter) of estimators for stratified populations, 
we shall have cause to consider cost optimality as a criterion of choice in 
addition to the idea of minimising the variance of unbiased estimators. 


45 Ratio estimates and pps sampling 


In Section 4.1 we contemplated the idea of estimating a ratio using pps 
sampling. The notion readily extends to a ratio estimator of a population mean 


(or total). 


25-13 Is i & 
Ame og f Vs if ae ~ 
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Suppose we take a sample (y, x)... (Yn, Xn) Without replacement in such 
a way that the probability of obtaining the observation y, is proportional to 
xj: Say po)=ky=H/ Lae 2 ; 

Further suppose that we know X and that (for illustrative purposes at this 
stage) Y; is proportional to X;, i.e. Y; = BXi. 

We see that sampling with probability proportional to the size of the ancillary 
variable X is just the same as pps sampling for Y, since 


p= p0y)=%/5 %-,/ Xie 


Now consider the estimator 
~ ; ae Ae 

a 2 9/R= =. ¥ (yi/%i).- (4.36) 
1 1 


We have seen in Section 2.9 that E( Y;)= Y; with variance zero in this 
pathological case. Thus we have ideal unbiased estimates of Y7 (or Y) as Y7 
(or Y;/N). 

But (4.36) is just based on the sample average ratio r, (see Section 4.1) but 
with pps rather than s.r. sampling. 

Although we will not in practice encounter strict proportionality, we can 
often come close to this condition and (4.36) remains a useful estimator. We 
then have the dual advantage in using (4.36) of obtaining a highly efficient 
estimator and of using a pps sampling scheme for the ancillary variable X 
which is likely to be easy to implement (whereas this might not be so for the 
principal variable, Y). 

Examples come readily to mind: 


4 xX 
Yield of wheat Acreage planted with wheat (sampled cartographically) 
Productivity of a Number of employees in different outlets 
multi-outlet company (sampled from an overall list of employees) 
Household income Size of household (sampled from Electoral Roll) 


SU EIIEnnnneeereeeeeene ee ee eee 


How are we to conduct pps sampling? There are various possibilities. If we 
have a list: 


», CRD, CHIE, GN: 


(i) We can form the partial sums 
: N 
Ay; Xy + X,...,F X= Xz). 
1 
We then pick a random number Z in (1, X;) and choose X; where 


a 
> ky es 
1 


=-M~ 


X; 
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But this is time consuming, and if we have such a list it can be easier to use 
the following approach. 

(ii) Lahiri’s Method. Suppose we know or can obtain an idea of the value 
of the largest X; (j=1,2,..., N). Call this X,,.,. We now pick at random 
and independently two numbers: one an integer in (1, 2,..., N) and the second 
in the range (0, Xmax)- Suppose we obtain j and x. If x < X; we take X; as our 
observation; otherwise we reject the pair of numbers and try again. However, 
this method can be rather sensitive to errors in our assessment of the value of 
D. a 

Procedures based on physical structure are also often available, such as 
sampling proportional to geographic area merely by choosing a location at 
random on a map. ; 


4.6 Exercises 


4.1. Part of a coniferous forest contains 280 trees of the same species and 
of similar ages. A preliminary estimate is required of the total weight of timber 
that these trees will yield. A forestry expert claims to be able to make fairly 
accurate assessments of the yield from any tree merely by visual inspection, 
and makes such assessments for all 280 trees. He assesses the total yield as 
439.5 tonnes. Subsequently, 25 trees picked at random are felled and their 
timber yields accurately determined. The actual yields, y;, and corresponding 
assessed yields, x;, provide the following summary results. 


25 25 
Y y;, = 39.8, Y x, = 41.4, 
1 1 


ZS 25 25 
y y2=69.08, Lyxi=70.64, Lx; =73.47 
1 1 1 

Estimate the total yield using either the ratio or regression estimator, 
whichever seems most appropriate. Compare the efficiencies of the ratio 
estimator, the regression estimator, and the estimator based on the sample of 
y; values alone. 


4.2. A field of wheat is divided into a large number of sampling units and 
the weights of grain (y;) and grain plus straw (x;) are recorded for an s.r. 
sample of size n. Additionally, the total produce (grain plus straw) for the 
whole field is weighed. Suppose that the coefficients of variation have the values 


Cy =1.0 Cxy =0.9 Cy =1.1. 
Calculate the gain in precision from estimating the total grain yield by means 
of a ratio estimator (using the x-values) rather than from the sample grain 
data alone. . 
Suppose it takes 15 minutes to weigh the grain on any unit and a further 2 
minutes to weigh the straw: also 2 hours to weigh the total produce. How 
many units should be chosen for the ratio estimate to be more economical 


than the s.r. sample total. 
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4.3 In studying lung function in a group of 560 workers in a coal mine, 
an estimate was required of the mean value of some relevant measure Y. An 
s.r. sample of 10 workers was chosen and their Y values, y;, determined by 
an appropriate test. A note was also made of their heights, x;. The results were: 


yi 3.0 35 3.3 x Se 4.1 3.2 a7. 2.9 3.9 3.4 
x, (cm) 173 183 170 175 160 157 168 180 178 163 


From routine medical records the average height for the group of 560 workers 
is known to be X =173.2cm. Estimate Y from the data, and calculate an 


approximate standard error for your estimator. 


5 


Stratified populations and 
stratified simple random 
sampling 


In our earlier enquries about the precision with which we can estimate a 
population characteristic, Y say, from the analogous quantity y in an s.r. 
sample, we saw that a crucial factor was the variance S¥ of the population. 
The larger the dispersion in the population, reflected in the value of S7, the 
less precise is the estimator y in the sense that its sampling variance is larger. 
This is only to be expected; it makes sense intuitively, it is a feature of classical 
statistical methods for infinite populations, and it is reflected in the value of 
the sampling variance, (1—f )S?/n, of ¥ in finite populations. 

Consider a simple numerical example. Suppose we have a finite population 
of 20 members in which Y takes values 


ne 4 5 ae 3. 2 Ge 3S C2 eee 


Its mean is Y =4; its variance, SY = 40/19. If we take an s.r. sample of size 
5 and use the s.r. sample mean j to estimate Y we have 


Var (7) = 6/19 = 0.316. 


Clearly we could obtain quite a range of different values for y in different 
samples; from 2.2 to 5.8. But notice the structure of the population. It could 


be rearranged as 


Sa 2 3 ame 44 4 aye 5 5) 5S Oe © 


In consists of 5 groups, in each of which all 4 Y-values are the same. Suppose, 
for this rather special population, we had some mechanism by which we could 
choose just one member at random from each group to constitute our sample 
of size 5. We must inevitably obtain on all occasions 


2, 3, 4, 5, 6, 


with sample mean 4. Thus our estimate has no sampling fluctuation, it sampling 
variance is zero, and the estimate is always equal to the population mean Y 
it is estimating. This extremely favourable situation has arisen because we 
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have been able to remove all variability from within the defined groups into 
which we have divided our population and from each of which we are faking 
a single observation at random to make up our required sample of size 5. 

Can this artificial situation be encountered, at least approximately, in prac- 
tice? If so, we should be able to reduce the sampling variance of an estimator 
below that encounted from s.r. sampling from the whole population. 

Let us consider a more practical example. We want to conduct a survey to 
estimate the mean height of the schoolchildren in a small primary school with 
four classes, each of about 30 children, covering four different age groups. We 
decide to measure the heights of a sample of 20 children for this purpose. We 
might do this by picking 20 children at random from the playground during 
their brief mid-morning break. If all 120 children were present in the play- 


Stratifice Hom 
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ground, we would obtain an s.r. sample of size 20. But there would not be 
much time for this, and it is not obvious how we would choose a random sample. 

For sheer convenience it might be better to visit the four classes at lesson 
time and measure the heights of an s.r. sample of 5 children from each class. 
A stimulus for this approach is convenience or ease of sampling, but note the 
effect. The classes reflect natural groupings of the population and, because of 
the relationship between stature and age, the heights in each such group are 
likely to be less variable than in the population at large. This effect will be by 
no means as extreme as in the numerical example above, but we might still 
hope that the relative homogeneity of the groups will lead to some improvement 
in the efficiency of our estimate of the mean height, compared with an s.r. 
sample drawn from the total population. 

If the population falls into natural groups, or can be so divided, it is called 
a stratified population. The examples suggest two possible advantages of such 
a structure: 

(i) the stratification might be an aid to efficient estimation under appropriate 
conditions in the sense illustrated above, ; 

(ii) the stratification may be particularly convenient in administrative terms 
making it easier to draw our sample. 

Thus, it would appear that we may be able to estimate some population 
characteristic more efficiently by sampling from each group (stratum) separ- 
ately than by sampling from the population at large, if it so happens that our 
variable Y shows less variation within each stratum than in the total population. 
This advantage cannot be guaranteed of course. Stratification chosen purely 
for administrative convenience (ease of access or of sampling) will not 
necessarily yield the required relative within-stratum homogeneity, but it will 
often do so. 

Clearly it is worth considering the situation in more detail. In this chapter 
we will study how to estimate population characteristics in stratified popula- 
tions, under what circumstances we can expect to obtain better estimators than 
those derived from an s.r. sample from the unstratified population, and the 
extent to which practical considerations may influence any potential efficiency 
gains from stratification. 


5.1 Stratified (simple) random sampling 


Suppose we wish to estimate the mean, Y, of the set of values Wis Y2, ge 
in a finite population. We shall assume that the population 1s stratified, that 
is to say it has been divided into k non-overlapping groups, OT strata, of sizes 


k 
NyLN, (= u=N] 


with members 


Yj (i= Ape f= 1, «0/Nb 
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Thinking of each stratum as a sub-population we can carry over our earlier 
notation, and denote the stratum means and variances by 


ae ARR 75 
and 

$3. SP eee 
respectively. 


The population mean and variance, Y and S’, will of course have the forms 


s-{s y (Y,- Y,+ v.-¥y'| 


i=1j=1 


ah {Ew-nsi+3 Nei- vy. (5.1) 


We shall assume that a sample of size n is chosen by taking an s.r. sample 
of pre-determined size from each stratum. The stratum sample sizes will be 
denoted n,, m2,.-.., M% Gi. n; =n). The s.r. sample from the ith stratum has 
members 


Vits Vi25-++ 5 Yin, (2=1, Zyvezs B®); 


and we denote the sample mean and variance in the ith stratum by 


ween a 
¥=—2 Jij 
nj j=1 
and 
t= ¥ (yi) 
i n;—1 j=; i iv 


In each stratum we have a sampling fraction, f, =(n;/N;) (i=1,2,..., k). 

Such a sampling procedure for the choice of a sample of total size n from 
the overall population is termed stratified (simple) random sampling. 

At this stage we shall not discuss the basis of the stratification of the 
population—we shall just accept that the population is so stratified, and that 
we wish to estimate Y from the sample values yi yielded by stratified random 
sampling. Later, we shall consider how the stratification might be constrained 
by administrative considerations (sampling ease, costs, etc.) or, in contrast, 
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how we can sometimes make use of any practical information we may possess 
about the population to effect a stratification likely to lead to particularly 
efficient estimation of Y. 

The estimator of Y commonly employed is the so-called stratified sample 
mean. This is defined as 


k 
Yst = 3 W, Yi. 
i=l 
Note: this assumes that we know the stratum sizes N;, precisely, in order to 
determine the stratum weights W; = N;/N. 
The stratified sample mean, j,,, is not the same as the overall sample average 


1.28 
y= Nii 
y= ee 


This would imply that the sampling fractions f, = n;/ N; are identical for all 
strata; a special form of stratified sampling where the stratum sample sizes, 
n;, are said to be chosen by proportional allocation (since the sample sizes are 
chosen to be proportional to the stratum sizes). 

Such a principle can be time-saving with regard to the collection of the 
sample data. We shall see that it can have statistical advantages, but again it 
presupposes a knowledge of the stratum sizes, N;. We assume throughout that 
this knowledge exists; otherwise the weights W, will need to be estimated in 
some manner, with a resulting bias in the estimator j,, and loss of accuracy 
in the later derived results. Further comments on this problem are made in 
Section 5.5. 

We must firstly consider the mean and variance of j,,. We have 


so that j,, is unbiased, 


in view of the inevitable unbiasedness of the stratum sample means, Jj. Note 
that 


nis. 


1 


E(5')= 


iMe 


a 


so that the overall sample average y will be unbiased only in the case of 
proportional allocation (where n,;/n= N;/N). 
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For the variance of y,, we have 


Var (Vsr) = 3 Wi(1-fi)Si/n, (5.2) 


The proof is straightforward: 


k 
Var (Js) = LW; Var (Ji) 
i=1 
provided (as is implied in the stratified random sampling procedure) that 
Cov (j,, ¥;) =0 if i#/j, that is, the simple random sample means for different 
strata are uncorrelated. Thus we obtain (5.2), using (2.3) for Var (jj). 

One further aspect of the behaviour of y’ needs comment. Although with 
proportional allocation y,, and y have the same numerical value, the overall 
sample average jy’ does not have the same variance as jy, the mean of an s.r. 
sample of size n from the whole population. The variance of jy’ takes on the 
appropriate special form of (5.2) (see 5.4) below, whilst 


Var (7) = (1—f) S/n, 


with f = n/ N. The reason for this discrepancy is the element of non-randomness 
of the stratified random sample that arises from the constraint that specific 
numbers, n;, of members of the sample must be chosen from each distinct 
sub-population defined by the stratification. 


Some special cases of (5.2) should be considered. 


(a) Sampling fractions f, = n,/ N, negligible, 
k 


Var (Fu) = WiSi/n, (5.3) 


(b) Proportional allocation, n; = nW,, f, = f = n/N, 
= | 
Var (¥,.) =——— Y_ W;S?. 
VA) ae X Si (5.4) 
(c) Proportional allocation, and constant within-stratum variances, S; = Siy 


(et be eas 


Var (Fs) -O2 Sen (5.5) 


Directly analogous results hold for the estimation of the population total, 
Yr. Thus, from the stratified random sample we obtain an unbiased estimator 
N¥s=Li-, Nivi, with variance Y\_, N? (1 — f,)S?/n;. 
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In practical situations the stratum variances, S?, will not be known. So if 
we wish to quote a standard error for the estimator y,,, or to construct an 
approximate confidence interval for Y, based on j,, (and involving its approxi- 
mate normality in large samples), we will need to estimate the S?. We have 
already considered this problem in the discussion (Section 2.4) of how to 
obtain an unbiased estimator of a population variance from an s.r. sample. 
The strata are just sub-populations and the sampled values in each stratum 
constitute an s.r. sample. 

Thus, as 


b) a7) - Gees...) 


are (unbiased) estimators of the stratum variances SG@aH1, 2°. 7k ae 
obtain an unbiased estimator of Var (¥,,) as 


s*(Iu)= Ee Wi(1—fi)si/ni 


i a NAN, —1,)82/m. (5.6) 


Naturally this requires that all stratum sample sizes should be at least 2, i.e. 
ne? (i= 1, 2;. eee 

It is not entirely unreasonable that occasionally we may encounter some 
n, = 1, typically when the population is highly variable and a very large number 
of strata need to be considered. Estimation of Var (j,,) is not hopeless in such 
situations: two ingenious methods for dealing with the extreme case where all 
n; = 1 are described briefly by Cochran (1977, Section 5A.12). 

In some situations the practical circumstances may suggest that all stratum 
variances are equal. If this is so, then it is desirable to combine the data from 
the different strata to obtain an overall, or ‘pooled’, unbiased estimator of the 
common variance Sj in the form 


We can now estimate Var (¥,,) by 
Sw k 
s*( In) == y N;(Ni - n;)/ Nj. 
N i=1 


In such a situation we will frequently strive (for reasons made clear later) 
to use proportional allocation in drawing the sample, and we then have simply 


s*( Pst) = (1 -=) sw/n 


as an unbiased estimator of Var (Pet): 
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Approximate confidence intervals for Y (or Y;) may again be constructed 
on the assumption of a normal sampling distribution for the estimator, y,,. 


These will be given (for confidence level 1— a) by 
Vat — ZaS( Ist) < Y < Isr t ZaS( Ise) 

or 
NL Yse — Za8( Vs) ] << Yr< N[Vort 2a5(Vsr)], 


As before, such approximations will only be reasonable if the conditions 
are satisfied (in terms of sample size, and so on) for j,, to have a distribution 
which is essentially normal, and for s*(¥,,) to be close in value to Var ( v2); 
The latter requirement is often the more stringent one, and replacement of z, 
by a percentage point for an appropriate ¢-distribution can lead to greater 
accuracy. But in stratified random sampling, the situation is far less clear-cut 
than in s.r. sampling from the total population, and no general prescription 
for the construction of approximate confidence intervals is available. One 
difficulty is that although total sample size may be large, the s.r. samples within 
each stratum will frequently not be. Some work has to be done on what 
constitutes an ‘appropriate’ number of degrees of freedom to adopt when using 
a t-distribution in place of the normal distribution, but it can hardly be claimed 
to lead to any universally applicable policy. 


Estimating a proportion Suppose a proportion P of the overall stratified 
population possesses a particular attribute. The estimation of P from a stratified 
simple random sample presents no problems. Each stratum sample member 
can now be thought of as having a value x; (i=1,...,k; j=1,...,n;) which 
is 1 or 0 depending on whether or not the sample individual possesses the 
defined attribute. 
k N, = ‘ 
Clearly P=));_, )j, X/N =X (the population mean X-value) and the 
stratified sample mean 
k 
Xst ra z Wx; 
i=1 
will, on the argument above, provide an unbiased estimator of P. Thus we 
define an estimator of P in the form of a stratified sample proportion, p,,, by 
rewriting the above expression for X,, in the form, 


k 
Pa= >d Wp; 
i=] 


where p; is the sampled proportion (just X;) in the ith stratum. 


If the actual proportions in the population strata are P; (i=1,...,k), then 
the variance of p,, is just 


Var (p,.) =¥ Wi(N; —n;)P;(1 — P;)/((Ni- 1)n;] 
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which, ignoring terms in 1/n;, can be rewritten in the simpler form 
Var ( ps.) = Wi(1—f,) Pi(1— P,)/n, (3:7) 


where f; is the sampling fraction n,/ N; in the ith stratum (i=1,..., k). 
An unbiased estimator of the variance (5.7) is given by 


y W3(1—-Ff,) p:(1 — pi)/ (nj; — 1) (see Section 2.10). 


5.2 Comparison of the simple random sample mean 
and the stratified sample mean 


In the introduction to this chapter it was suggested that stratification of the 
population may on occasions increase the efficiency with which we can estimate 
population characteristics such as Y or Y;. To examine this possibility let us 
compare j and j,, in the same situation. Both are unbiased estimators. Which 
is the more efficient, in the sense of having the smaller variance? We know that 


Var (9) =(1-f)S*/n, 


whilst Var (j,,) is given by (5.2). To simplify the comparison we shall suppose 
that the stratified sample has been drawn with proportional allocation. Then, 
from (5.4), 
= k N. 

fy *s; 


1 
Var (sr) = 
ar (Vs) n one 


and 


Var (¥) — Var (sr) ut Pax »? 


nsi) , 


1 


But, by (5.1), 


s-1| § en-pst+ EM 7]. 


Now if the stratum sizes, Nj, are large enough 


ia Ni (5.8) 


and 


so that from (5.7) 


Var (5) —Var (Ju) =P ¥ NK PY 


sei (f~ 2, (5.9) 
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which is positive, unless, exceptionally, the Y, are all the same. Thus it peices 
that the stratified sample mean will always be more efficient than the s.r. sample 
mean. the more so the larger the variation in the stratum means. 

But the assumption (5.8) proves to be quite crucial. Suppose the stratum sizes 
are not large enough for (5.8) to be a reasonable approximation. Then, using 
(5.1) for S?, we obtain the more accurate expression 


1- 
Var (5) ~Var (Fu) =sox ay 


ap) NeZ.- P-L ES (N-NSIH, (5.0) 


which need not necessarily be positive. This more precise comparison shows 
that j,, is not necessarily more efficient than y under all circumstances; j,, will 
be more efficient than y if 


k S — 1 k 
y N(¥,- YY>— LT (N-N))Si. 
i=1 N i=1 

A further specialisation gives a more easily interpretable form for this 
condition. Suppose all the strata have the same variance, Sy. Then we require 


k 
=F Nee SY, (5.11) 

k 2, 1 i=1 
Thus, the stratified sample mean will be more efficient than the s.r. sample 
mean if variation between the stratum means is sufficiently large compared with 
within-strata variation; the greater this advantage the greater the efficiency of 
jx, Telative to y. (Note how this comparison mirrors the analysis of variance 
criterion in infinite normal samples for testing homogeneity of a set of means.) 
Summarising the results of this section we can informally conclude that the 
higher the variability in stratum means, and the lower the accummulated 
variability of within-stratum Y-values over all the strata, the greater is the 
potential gain from using the stratified sample mean jy,, (rather than j) for 

estimating Y. The same will be true, of course, for the estimation of Y;. 


Example 5.1 


It is interesting to see if we can illustrate the properties of j,,, which 
are discussed above, by stratifying the population of heights given 
in the Statistics Class and studying corresponding stratified sample 
means. There are certain obvious, if rather artificial, methods of 
stratification in this situation, namely 
(i) by rows, 

(ii) by columns, 

(iii) by sex. 

Thus in (i) and (ii) we have 5 strata of equal size, each stratum 
consisting of the 5 students sitting in some particular row, or column, 
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respectively. (See Figure 1.1, where the row-strata are labelled 
I,..., V; the column strata A, B, C, D, E.) In (iii) we have two 
strata, male and female students, with 15 and 10 members respec- 
tively. Suppose we take a stratified random sample of size 5, with 
proportional allocation, in each case. That is, in 
(i) we pick one student at random from each row, 
(ii) we pick one student at random from each column, 
(iii) we take an s.r. sample of 3 male, and an s.r. sample of 2 
female, students. 

To compare the sampling distributions of y,, (in each case) with 
the earlier derived empirical sampling distributions for y, and for 
Jr, Should prove interesting. Relevant population characteristics are 


(i) Y\= 11.0 Y= 21.0 Y,,;= 25.6 
Yiv = 32.4 Yy = 29.4 
S?2 =22.0 S$ =74 S?.,= 1428 
St7=39.3 S2,= 49.3 
(ii)  Y,=25.2 ¥p=22.8  Yc=20.8 
Yp = 23.8 Y; = 26.8 
Su = 68.2 Sp =95.2 2 = 39:7 
$2,=161.2 S2'=922 
(iii) Y= 28.9 Y-= 16.4 
S2,= 42.0 S2=45.6 


Five hundred stratified random samples of size 5 have been chosen in each 
case, and figures 5.1, 5.2 and 5.3 present the three histograms of values of Ys, 
so obtained. The approximate variances of ¥,,, estimated from the 500 values 
of ¥,, in each case, are 


G) 436 (4.19) 
(ii) 12.99 (14.52) 
(iii) 6.86 (6.92) 


The values in brackets are those obtained directly from the theoretical form 
(5.4) using the appropriate population characteristics. 

We see that stratification by rows produces a dramatic increase in efficiency 
over the s.r. sample mean, y—whose variance is 12.87. The efficiency of Fes 
is of similar order to that of the ratio estimator, Yr, which has estimated 
variance 4.45. Likewise, stratification by sex provides an improvement over y, 
but is not as good as yr. Stratification by columns clearly confers no advantage; 
indeed we will do better even to use the s.r. sample mean, y. 

These results reflect what we would expect on theoretical grounds. For 
stratification by rows (and to a lesser extent by sex) we see that there are 
substantial differences in strata means, but relatively little variation within 
strata—precisely the conditions for a useful gain in efficiency from using 
stratified sampling. Thus y, appears to be more efficient than y. For 
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Frequency 
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Fig. 5.1. Histogram of 500 values of y,, (rows) for samples of size 5 in the Statistics Class 
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Fig. 5.2. Histogram of 500 value of y,, (columns) for samples of size 5 in the Statistical Class 


stratification by columns the reverse is true, so (as we would expect) j,, has 
little merit. 

What is puzzling is precisely why stratification by row, and by column, 
should produce such distinctly different effects in this real-life population. It 
is far more marked than might be anticipated on intuitive grounds; nonetheless 
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Frequency 


10 15 20 25 30 35 
Yet (SEX) 


Fig. 5.3. Histogram of 600 value of 7,, (sex) for samples of size 5 in the Statistics Class 


it is genuine, and is a fine illustration of the range of possibilities in stratified 
sampling. Can you think of an explanation? 


5.3 Optimum choice of stratum sample sizes 


So far it has been assumed that the total sample size n, and the stratum sample 
sizes n;, have been prescribed. The study of the properties of y,, has assumed 
some particular allocation of the stratum sample sizes n,, M2,---, Mk- As in all 
survey sampling schemes, we must remain aware of the need to achieve some 
required precision of estimation, and to do so either for minimum cost or (if 
possible) within some cost limitation imposed by the resources available for 
conducting the enquiry. Such factors are no less important in stratified sampling 
than for other sampling schemes. They may be even more important. 

Often we have only limited choice of the basis of stratification. The major 
determinant is frequently an administrative one: that different methods need 
to be used to sample different sections of the population, or that natural 
(financial, geographic, social) divisions exist across which complete randomisa- 
tion is clumsy, and unnecessary. For example, in a sociological study, lack of 
a common listing, differential problems of access and communication, desire 
for representative coverage of the population, and so on, could make it 
undesirable to attempt to sample at random the whole population (of hospital 
patients, prisoners, old-age pensioners, etc.). 

Intrinsic interest in the sub-populations themselves also encourages 
stratification—a point we shall return to later. Then again, in a national 
geographic survey, it is likely to be most convenient to sample different regions 
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separately; costs of sampling can also vary from region to region (stratum to 
stratum) if only with respect to travelling expenses for survey workers. 

So, again, we must consider the question of how to choose the sample size 
n to satisfy certain precision or cost requirements. Since different strata are 
likely to exhibit different degrees of variability, we must inevitably proceed 
beyond the choice of n to the allocation of the individual stratum sample 
sizes, n;. Also, to merely state our requirements in terms of the variance of 
some estimator will not be sufficient in general. Different sampling costs for 
different strata imply that we must attempt to take some account of cost factors 
in determining a desirable allocation of stratum sample sizes. 

In any particular problem, local circumstances should enable a reasonably 
precise statement to be made of sampling costs in the different strata. Simple 
cost models have been proposed within which a large number of practical 
problems can be accommodated. The simplest form assumes that there is some 
overhead cost, cy, of administering the survey, and that individual observations 
from the ith stratum each cost an amount c;. Thus the total cost is 


k 
C=cot X CiN;. (5.12) 

This is the model which we shall adopt, although constant unit cost for 
observations sometimes exaggerates the true situation, and the replacement 
of ae (c;n;) by, say, eae (d,;/n;) may be better. This latter form is more 
reasonable, for example, when the major cost ingredient is for travel. 

Suppose we adopt the cost model (5.12) and ask what allocation of stratum 
sample sizes, n,, W2,..., n,, should be adopted to 

(1) minimise Var (j,,) for a given total cost C, 

(II) minimise the total cost C, for a given value of Var (j,,). 
We shall consider these cases separately. 


I. Minimum variance for fixed cost 
We must choose n,, n2,..., nm, to minimise 


Var (j,)= ¥ WiSi/m—-—— 5 WS? (see (5.2)) 


subject to the constraint 
k 
ok Cn; = & —"Co. 
i=1 


Introducing a Lagrangian multiplier*, A, we will need 
—Ww?s 
BP Ba (iS Sere) 


- op : : : 
Readers unfamiliar with this technique may either accept the results (5.13) and (5.14) on trust 


or may demonstrate them by the (lengthier) t f ituti 
Section 2.2 (page 26). & ) type of substitution argument used at the end of 
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or 
nvA= WS;/Vc;. 


Multiplying each side by c; and summing over the strata, gives 
k 
(C-e)VA= ¥ WSVC. 
i=1 


Thus, the optimum allocation for fixed total cost is given by 


ms (C=) WS; /Vc; 


k 
y Ws.Ve, (5.13) 
i=1 
and the total sample size will be 
k 
(C- 6) © WS/Ve 
a = (5.14) 


k 
WSiVc; 
=I 


We see, then, that the stratum sample sizes will need to be proportional to 
the stratum sizes, proportional to the stratum standard deviations, and inversely 
proportional to the square root of the unit sampling costs in each stratum. 
(Large, highly variable strata with low unit sampling costs will lead to large 
samples relative to those from other strata.) 

A special case of these results arises when the unit sampling costs ¢; are the 
same in all strata, so that 


C=) 1c, 


where c is the (constant) unit sampling cost. 
The optimum allocation now needs 


(D213) 


with 


Sree (5.16) 
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Clearly the allocation (5.15) can equivalently be regarded as optimum for 
fixed sample size, ignoring variation in costs of sampling from one stratum to 
another, in the sense that, given n, it minimises Var (Vs1)- 

The allocation (5.15) is given a special name: it is called Neyman allocation, 
after J. Neyman who gave an early proof of its optimality. The resulting 
minimum variance (for Neyman allocation; that is either for fixed sample size 
ignoring sampling costs, or with a cost limit and constant unit sampling costs) 
is obtained by substituting (5.15) in (5.2) to obtain 


1 k 2 4 & 

Varin (Fu) == ( 2 ws) -— ¥ WSi, (5.17) 
nN \i=1 N i=1 

the second term arising from the f.p.c. In the cost-limited case, with constant 

unit sampling costs, n will need to be determined from (5.16). 


II. Minimum cost for fixed variance 

Suppose that, instead of putting a limit on total cost, we fix Var ( y,,)—-perhaps 
by imposing some precision requirement in the form that y,, should have a 
certain (high) probability of not differing from Y in absolute value by more 
than a specified amount. We would now like to satisfy (for a prescribed value 
of V) a condition 


Var (Vst) =V 


for the minimum possible total cost. The appropriate allocation of stratum (and 
total) sample sizes is immediately determined from a reinterpretation of the 
results under I. We can see this informally from the following argument. We 
know that for any specified total cost, Var ( y,,) is minimised when the n; are 
chosen to be proportional to W,S,/Vc;. For given V there will be some total 
cost C for which this allocation yields V as the minimum variance. If C were 
made larger, then, in view of the explicit form of n,; given by (5.13), the 
minimised variance would become less than V; this we do not require. If C 
were smaller, by a similar argument we see that no allocation could restrict 
Var (y,,) to as small a value as V. Thus choice of n; proportional to W,S,/Vc, 
must also minimise total cost for a given value of Var (j,,). 
Specifically we need 


oe kWSi/V ci, 


where k must be chosen to ensure that 


k k 
Var (Fy) = 5 W3S?/n,-— 5 WS?=V. 
i=] i=] 
Hence we must take 
k 
WSiV c 
Bs i=1 
0 (5.18) 
V+— y WS 


N i=1 
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and the total sample size n will be 


1 k 
V+— ¥ WS? 
N > aah | 


Again, if unit sampling costs are constant at some value c, we see that 
Neyman allocation (5.15) is optimum in the sense of minimising total sample 
size (since this is equivalent to minimising total cost) for a given Var ( Vst) = V. 
The resulting minimum total sample size will be 


(5m3)'/(vZ ms) 


One more situation warrants investigation. Optimum allocation may not be 
feasible; suppose instead that we must operate with prescribed sampling weights, 
w, =n,/n, for the different strata. It is nonetheless useful to examine how to 
choose the total sample size to achieve a specified value for Var (¥,,). 


III. Sample size needed to yield some specified Var (y,,) with given 
sampling weights 
Suppose we want Var (j,,) = V. For example, we might specify a margin of 
error, d, and acceptable probability of error, @, in the sense that we need 
Pr {|¥,,- Y|> d} <a. 


Proceeding as in Section 2.6, using an assumed normal distribution for 
(but see Section 5.4), this would require 


V=(d/z,.)- 


Equating Var (j,,) to V gives 
k 1 k ‘i 
n= ¥ (wist/m) /( V+, 5 Wsi). 
i=1 i=1 


Thus, as a first approximation to the required sample size, we have 


or more accurately 


diol aes 
n=no(14+a7 2, ws?) ° 
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In the special cases of proportional allocation, and of Neyman allocation, we 
have 


k 
m=, & WS, n=n(i-t mel NY, 
i=1 
and 
1 k 2 k ss 4 
ge a a ws?) 5 
fee Api WS.) . g ro NV 
respectively. 


5.4 Comparison of proportional allocation and 
optimum allocation 


A crucial consideration in studying the efficiency of the stratified sample mean 
is to ask to what extent optimum allocation of sample units to the different 
strata is better than proportional allocation. 

Proportional allocation is straightforward; it requires no knowledge of 
stratum variances or relative sampling costs. This type of knowledge is required 
for optimum allocation in the sense of the previous section. Such knowledge 
may not be available, or only known imprecisely. Its acquisition may require 
fairly detailed preliminary enquiries, or the acceptance of certain assumptions 
about the population structure that are difficult to justify. 

Before embarking on such enquiries, formulating such assumptions, or 
considering the implications of an imprecise statement of sampling costs or 
stratum variances, we should have some idea of what potential gain can arise 
from optimum, rather than proportional, allocation. 

We shall consider just one case, the comparison of proportional allocation 
and Neyman allocation (optimum for constant unit sampling costs in different 
strata). 

Denoting Var (y,,) by Vp and Vy, for proportional and Neyman allocation, 
respectively, it must be true that Vp> Vy. More specifically, from (5.4) and 
(5.17), we have 


ae = 
id asl lags x W,(S;-—S)’, 
i=1 


where Sy (W,S;). Thus the extent of the potential gain from optimum 
(Neyman) allocation compared with proportional allocation depends on the 
variability of the stratum variances: the larger this is, the greater the relative 
advantage of optimum allocation. 


Example 5.2 


Consider once again the situation described in Example 2.4 where 
we wish to estimate the total Christmas Card sales for a network 
of 243 stationery shops, by asking some of the shops to submit 
‘early returns’ at the end of January. Suppose that for general 
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accounting purposes the shops have been divided into three groups 
on the basis of their average annual turnover for all products over 
the period of the previous five years. The finite population is thus 
stratified with respect to annual turnover. The full July returns of 
the Christmas Card sales over recent years enable us to make fairly 
precise statements of the three stratum variances. It is reasonable 
to expect that sampling costs will be higher for the larger shops 
and this turns out to be so. The strata sizes, variances, and sampling 
costs (in appropriate units) are as follows: 


Average turnover (£000) N, Sj C 
Less than 50 146 =0.16 2 
Between 50 and 100 62 ~0.58 3 
Greater than 100 S557 0:31 4 


Suppose we again want to estimate total sales for the current 
year, in such a way that we have a 95% chance of our estimate 
being within 10% of the true figure. We find that this requires (again 
assuming that total sales will be in the region of 420, and using the 
normal approximation) that we restrict the variance of our stratified 
sample estimator of total sales to about 460. Equivalently, we need 
Var (¥,,) = V = (0.0882). 

With proportional allocation we need sample stratum weights 
0.601, 0.255 and 0.144, respectively. Now 


1 k 
— ¥ WS; =37.126, 
V ia 


and 
37.126(1+37.126/234) ' =32.206, 


so that allowing for the fact that sample sizes must be whole 
numbers, we will need to take 34 observations: with 20, 9 and 5 
observations, respectively, drawn at random from the three strata. 

Taking the calculations a stage further we might seek the optimum 
allocation. The sample weights now need to be about 0.527, 0.348, 
and 0.124, respectively. The required total sample size is now 31, 
consisting of 16, 11, and 4 observations from the different strata. 

What we should note from these results is the impressive reduc- 
tion in the required sample size (from 62 to about 33) that arises 
from exploiting the stratification of the population We would expect 
to obtain such an improvement from stratification in this situation, 
in view of the stratum variances being small relative to the popula- 
tion variance (which is of the order of 0.64), and in view of the 
fact that the stratum means will inevitably vary widely (being 
correlated with total turnover, our basis of stratification). 

The reduction for optimum, compared with proportional, alloca- 
tion (from 34 to 31) is modest by comparison. 
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In examining the relative efficiency of Y,, and y in Section 5.2, we assumed 
proportional allocation for the stratum sample sizes. It is interesting to see 
how the comparison differs for optimal (Neyman) allocation. 

Denoting Var (y) (for simple random sampling) by V and using the form 
(5.17) for Vy we have 


V=Ve>0 
if 
(--=) S?-$?/n+¥, (W,S?)/N> 0. 
n N 


Substituting for S? its form (5.2) in terms of stratum means and variances and 
assuming that stratum sample sizes are large enough for (5.8) to hold we find 
that V— V, > 0 provided 


E w(S,-S¥/n+(<-—) 5 WY ¥P>0. 

n N 
Obviously, however, both terms are non-negative and we must obtain an 
efficiency gain from stratified s.r. sampling with Neyman allocation except in 
the limiting case where all stratum means are equal, and all stratum variances 
are equal. The efficiency gain will thus be the greater, the larger is the variability 
either in stratum means or in stratum variances. 

It is useful to review the results of this section and of Section 5.2 to see 
what we have discovered about the relative efficiencies of the s.r. sample mean 
(y) and the stratified sample mean under proportional allocation (j,..,)) and 
Neyman allocation (..~)). The following qualitative comparisons hold when 
stratum sizes are large enough for the approximation (5.8) to be reasonable. 


Yep) Versus y Ysu(p) 18 at least as efficient as j; the relative efficiency increases 
with increase in the variability of the stratum means 


Vsucn) Versus Ysecn) is at least as efficient as j; the relative efficiency increases 
with the increase in the variability either of the stratum means or of 
the stratum variances 

STEERER e 


Yst(n) VETSUS Y5,(p) Yen) 1S at least as efficient as ¥,,.p); the relative efficiency increases 
with increase in the variability of the stratum variances 
i 


There is a tidy logical development in these relative efficiency comparisons. 
We have already seen, however (in Section 5.2), that the situation becomes 
more complicated if (5.8) cannot be assumed. For the comparison of J, p) 
and y we found that an efficiency gain required the variability of the stratum 
means to exceed a weighted combination of the stratum variances. 
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The comparison of y,,;~) and j,,;p) is unaffected by (5.8). 

This leaves open the question of how crucial is (5.8) to the comparison of 
¥sec~) and y. No simply expressed, or readily interpretable, intercomparison 
arises in this case and we shall not pursue it further, especially since (5.8) 
does not represent a major constraint and it is likely to be satisfied in most 
well-designed surveys. 


5.5 Some practical considerations 


We have considered in some detail the effects, and possible advantages, of 
stratification of the population in relation to the estimation of Y and Y;. The 
discussion has been rather formal. It was assumed that the values of certain 
population characteristics—such as stratum sizes, Ni, and variances, S;—are 
known. Little attention was given to real-life considerations in the choice of 
strata, or to the practical problems of determining stratum sample sizes when, 
as is likely, we do not have very accurate knowledge of the N; or the Sicin 
this section we shall examine such matters a little more fully. 


Unknown N, and S? 


The various results we have obtained for Var(¥,,) under different circum- 
stances, and for the choice of stratum sample sizes, have been expressed in 
terms of the N; and S?. Very often in practice we will have no precise knowledge 
of the values of these quantities. At best we can estimate them from the survey 
data, or informally assign ‘reasonable’ values on the basis of previous 
experience. Any conclusions must reflect this lack of precise knowledge. 

Even if we know the stratum sizes N;, and adopt some prescribed allocation 
of stratum sample sizes n;, then uncertainty of the values of the S; will mean 
that we cannot accurately assess the variance of the estimator y,, of Y. The 
estimation of Var(/y,,) from the sample data has been briefly discussed in 
Section 5.1. 

If the N, are also unknown, greater difficulties arise. The stratum weights 
W, = N,/ N are crucial ingredients in Var (¥,,), and if we are to use a stratified 
sample mean we must estimate them in some way. It is possible that published 
data may help. For example, nationally based returns such as the Census (for 
human population factors), or other large-scale government department 
surveys (of agricultural, industrial, medical or educational factors), will contain 
a great deal of breakdown of information for the country as a whole. If it is 
reasonable to believe that some ‘local’ population represents the larger national 
environment, then we can adopt the national stratum weights in the local 
enquiry. 

But such representativeness is far from inevitable—local populations are 
notoriously idiosyncratic. The determination of stratum weights in this way 
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can be perfectly reasonable—it can be fraught with danger. The major 
safeguard lies in the experience, and shrewdness, of the investigator. Previous 
successes and failures must guide any decision to carry over ‘global’ stratum 
weights to the local problem. 

Statistical methods are of little help, except on those rare occasions where 
the potential gains from stratification are large enough to justify the expense 
of conducting some fairly detailed pilot survey designed purely to estimate the 
stratum weights as a preliminary to the main stratified sample survey. This 
approach is a further example of double sampling, or two-phase sampling, and 
makes it possible to take formal account of the sampling properties of the 
estimators of the W, (implied by the method of drawing the pilot data) in 
assessing the properties of y,,. 

Certain general remarks can be made concerning the likely effects of 
imprecise knowledge of the W,. It will usually happen that y,, will be biased; 
its accuracy is best assessed by mean square error about Y, rather than by 
Var (¥,,). Furthermore, the bias does not tend to reduce with increase in sample 
size. Estimation of the mean square error (necessary because the stratum 
variances will also be unknown) will introduce further imprecision. 

Lack of knowledge of the N; and Se (particularly the latter) becomes even 
more serious when choosing the allocation of stratum sample sizes. Such an 
allocation must be determined before we choose the main sample, so that we 
do not even have sample estimates of the S? to help us. 

This is the same problem (but in more acute form) as the one considered 
in Section 2.6 concerning the choice of the size of ans.r. sample in an unstratified 
population. Some of the suggestions made at the earlier stage may again prove 
fruitful: e.g. the use of provisional estimates of variance from pilot studies or 
from similar surveys carried out previously. Or again we may consider taking 
preliminary samples from each stratum to yield rough estimates of stratum 
variances which are in turn used to determine an appropriate allocation of 
stratum sizes. The preliminary sample is then ‘topped up’ to the required 
allocation. This is another illustration of double (or two-phase) sampling. 

Imprecision arises using any of these methods and will be reflected in, for 
example, what we might hope to be an optimum allocation being non-optimum 
in practice—perhaps seriously so. Cochran (1977, Chapter 5A) considers in 
more detail the effects of non-optimum allocation, also the effects of several 
other aspects of incomplete knowledge of stratum sizes and variances. See 
also Sampford (1962, Chapter 6) for further practical details. 


Over-sampling of strata 


When applying the formulae (5.13) or (5.18) to determine optimum stratum 
sample sizes, it is not impossible that some of the resulting n; may exceed the 
corresponding stratum sizes Nj. This is particularly likely to happen if the 
sampling fraction is large and if stratum variances differ widely. Obviously 
we cannot take more observations that there are members in a stratum, and 
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the optimum allocation cannot be attained. The usual practice is to fully sample 
any strata for which the optimum n; exceed N;—that is, to take all members 
of those strata. For these strata Y, and S$? will be determined exactly. The 
variance of the resulting y,, cannot now be obtained from the results for 
optimum allocation, and we must take care to use the appropriate form of the 
general expression (5.2) (or its estimator (5.6)), recognising that certain samp- 
ling fractions f; will be unity and hence that the corresponding terms in Var (¥,,) 
will not contribute. 


Sub-populations; multi-way stratification; several variables 


These represent just a few of the possible extensions of study of stratified 
populations, and only brief comments will be made. 

In a sample survey we often wish to estimate characteristics of sub-popula- 
tions (or ‘domains of study’) as well as characteristics of the whole population. 
For example, in a survey of apple prices from different suppliers from a range 
of geographic regions we might be interested in average prices for each variety 
of apple over the different regions, or for each region over the different varieties 
of apple. 

A stratified random sample will reflect both aspects but with different 
properties depending on the criterion of stratification; region or variety of apple. 

(i) Suppose the sub-populations are the same as the strata. The samples in 
each stratum are just simple random samples of predetermined size. So 
we need only use the results of simple random sampling to estimate 
sub-population characteristics directly and to reflect the properties of 
the estimates. For example, if we stratify by variety of apple we can 
readily estimate means for the different varieties and assess the proper- 
ties of the estimates. We recall that stratification is most advantageous 
for estimating overall population characteristics when between-stratum 
means differ widely and within-stratum variances are as small as poss- 
ible. In this sense stratification by variety of apple might be better than 
stratification by region (although it may not be as convenient administra- 
tively): compare prices for local apples and imported ones. 

(ii) Suppose sub-populations do not coincide with the strata. We might, for 
example, still be interested in means for different varieties of apple but 
our stratification is by region. Thus the sample might be made up of 
simple random samples from each region with the price of just one 
variety of apple (possibly chosen at random) from each sampled sup- 
plier. The sub-population sample for a variety of apple now consists of 
some observations from each stratum (region). It is no longer a simple 
random sample: the sample size is now a random quantity. Appropriate 
estimators, and their properties are now more complicated, and would 


require special investigation. 
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In the example just discussed, we have assumed that we have stratified by 
one factor (here region) and have taken random samples from each stratum 
(region) over the other factor (here variety of apple). 

But this is hardly realistic or economical. 

If we stratify both by region and by variety of apple we could, of course, use 
the properties of simple random sampling for either interest, i.e. for regional 
means or for apple-variety means. 

Alternatively, we might have stratified by region, chosen suppliers at 
random and recorded simultaneously in each case the prices X, Y,... of 
each of the varieties of apple on each sampling unit. Our characteristic 
variable is now multivariate, and yet further difficulties arise. One problem is 
that what constitutes an optimum (or even a reasonable) allocation of 
stratum sizes for one variety of apple may not do so for another. The first 
stage must be to determine and compare appropriate allocations for the 
different varieties separately. If the varieties do not differ widely in terms of 
variations in means, or in intra-regional variances, from one region to 
another a compromise can be adopted which is not far from the 
appropriate (optimum or, perhaps, proportional) allocation for each. If they 
do differ widely (and for a wide selection of varieties this could be so) there 
can be no sensible compromise, and additional criteria must be introduced. 
These will usually be based on an assessment of the relative importance 
of the different variables. This might be a purely subjective procedure or 
occasionally a more formal decision theory type of analysis might be con- 
ducted. 

Let us return to the notion of multi-factor stratification. In large scale surveys 
there will often be great appeal in stratifying the population with respect to 
several factors simultaneously. In a sociological enquiry it might appear 
desirable to divide the population into different sexes, different employment 
groups, different nationalities, different types of accommodation, and so on. 
Several practical interests support this: a desire to make the sample ‘reflect’ 
the population as a whole, or to facilitate the study of highly specialised 
subgroups (possibly different groups are of particular relevance for different 
variables being simultaneously recorded). 

The strata become correspondingly specialised: female, highly paid, Welsh, 
lathe operators living in high-rise flats, and so on! For both the estimation of 
overall population characteristics, and the characteristics of subgroups, there 
might be some disadvantages in such multi-way stratification. 

The individual factors of stratification (sex, income group, etc.) will have 
been chosen for practical interest. They may not necessarily correspond to 
desirable bases of stratification in the statistical sense of leading to improved 
efficiency of estimation of overall population characteristics. Simultaneous 
stratification by several factors soon leads to a vast number of strata—just 3 
factors each at 4 levels yields 4° = 64 strata. Except in a very large survey, 
individual stratum sample sizes are bound to be very small. The precision of 
estimators of stratum means (totals, etc.) will be correspondingly low. 
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Administrative difficulties are likely to arise in the choice of random samples 
for the specialised subgroups. These subgroup (stratum) sizes and variances 
are unlikely to be known (or to be estimable) with any precision, so that the 
appropriate allocation of stratum sample sizes, or the estimation of the variance 
of estimators, will be difficult and unreliable. 

One particular problem is that there may be so many strata that it is not 
economical to ensure that they are all represented even with a single observa- 
tion. Such a problem is discussed and illustrated by Cochran (1977, Chapter 
SA), who once again provides a more detailed development of the various 
topics introduced throughout this sub-section. 


Post-hoc stratification 


Suppose plans have been drawn up to conduct a sample survey on a stratified 
population, and that stratum sizes and stratum variances are known. We can 
use the results derived in the earlier sections of this chapter to determine the 
variance of the stratified mean j,, for any allocation of stratum sample sizes— 
proportional allocation might be a typical choice. Alternatively we could 
calculate the optimum allocation that should be used to minimise Var (y,,) or 
total cost. 

However, the success of our efforts will rest on our ability to actually obtain 
the appropriate sample in the practical situation. We considered earlier some 
of the practical difficulties of drawing a truly random sample. In stratified 
sampling, we need such a sample from each stratum, and a major complication 
can arise in the respect that we may not be able to determine in which stratum 
an observation belongs, until it has been drawn. 

This can happen, for example, where strata correspond to different personal 
details on people—such as their religious beliefs, income levels, nationality, 
educational achievement, and so on. For such factors, published national 
reports may provide a clear indication of stratum weights (sizes) and variances; 
but it can be most difficult to sample individuals from specific strata. We will 
consider shortly the quota sampling method which attempts to overcome 
such difficulties. Sometimes, however, we may have no alternative but to 
draw our sample and stratify it subsequently: that is, carry out a post-hoc 
stratification. 

In this situation there can be no prospect of drawing a stratified random 
sample, since we cannot draw s.r. samples from specific strata. Instead we 
might take an s.r. sample of size n from the whole population, and subsequently 
assign individuals to the different strata. Although we do not obtain a stratified 
random sample, we should expect to encounter numbers in the different strata 
roughly in proportion to the stratum sizes, N,. The resulting post-stratified 
sample should be somewhat similar to that which would have been obtained 
by stratified random sampling with proportional allocation, provided that 
numbers of individuals, n/, falling in each stratum, i, are reasonably large. If 
we were now to estimate Y by the quantity analogous to j,, (rather than by 
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y), that is by 
= k z 
y= b> Wii, 

i=1 


where yj, is the mean of the observations that are in stratum i, we might expect 
to recover some of the potential advantages of J,, itself. 

This proves to be so. As long as the S? do not differ widely, and the nj} are 
large, y behaves similarly to y,, obtained from proportional allocation. Its 
variance is thus approximately (1—f) Y W,S;/n, which (as we have seen) can 
be considerably less than Var (7) =(1—f)S7/n. 

Another possible use of post-hoc stratification is to correct ‘obvious lack 
of representativeness’ in an s.r. sample, as illustrated in the following 


example. 


Example 5.3 


Suppose we draw a random sample of 10 individuals from the 
Statistics Class data, and wish to estimate the population mean 
height Y. The particular sample drawn happens to contain 7 women, 
and 3 men; the sample mean is 


y=19.1. 


But we know that 60% of the population are men; our sample 
contains only 30% men and 7 is surely likely to under-estimate Y 
(which is 23.9). If instead of using y, we work out the sample means 
for the men and women separately, we obtain, respectively, 


ip = 91.0 copped: 
The weighted estimator y for such post hoc stratification yields 
¥ =0.6 x 31.0+0.4 x 14.0 
= 24.2 
which is much closer to the true value (23.9). 
Clearly, little can be claimed (other than vague intuitive appeal) for use of 
such a procedure in small samples, or without some prescription of what 
degree of ‘lack of representativeness’ will be needed to prompt the use of y. 


But if the sample size is large enough to be confident that stratum sample sizes 
will also be large, and y is always used, whatever the constitution of the 
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sample, we return to the earlier situation where y has essentially similar 
properties to y,, obtained by proportional allocation. 


5.6 Quota sampling 


Closely allied to stratified random sample is the method of Quota Sampling, 
which is widely used in market research, opinion surveys, and a variety of 
nationwide enquiries. 

Its importance lies in the fact that it is the principal method of sampling employed 
by commercial data-collection organisations which service business needs for 
survey information, as well as being commonly used by individual agencies 
requiring regular re-appraisal of attitudes or activities in society. Political 
views, reactions to new or proposed government policy, patterns of trade and 
industry, consumer attitudes to products, and television audience sizes are all 
likely to be assessed through surveys based on quota sampling principles. The 
ubiquitous ‘opinion poll’ provides full scope for the method. 

In essence, quota sampling is merely stratified random sampling with a complex 
multifactor stratification and with stratum sample sizes chosen by proportional 
allocation. The strata are chosen principally to ensure a ‘representative picture’ 
of the population with respect to the factors of stratification, and to yield 
estimates in specific subgroups, rather than in a desire to enhance efficiency 
of estimation in a statistical sense. But this latter ‘spin off’ can arise if the 
strata happen to have appropriate form (wide discrepancy of mean values, 
low internal scatter). The practical interest may often produce this effect— 
consider, for example, stratification by age, employment group, geographic 
region, etc., in different situations. 

Where quota sampling differs from stratified random sampling is in the fact 
that the stratum samples may not be random, an element of subjective choice 
enters into the sampling practice because of the manner in which it is conducted. 

Typically, strata are defined, and the stratum sample sizes needed for 
proportional allocation are then calculated from (more or less) known overall 
stratum sizes in the population. The data are then collected by instructing 
interviewers or interrogaters to fill the quotas for the different strata, by street 
interviews, house to house enquiries, postal questionnaires, and so on. 

The inevitable effect is that we cannot be sure that the selection of respon- 
dents is at random within the strata. To fill the quota by arbitrary selection 
would be time consuming. Successively more and more observations would 
be rejected as time goes on and as quotas fill up, and the practice is adopted 
of allowing the interviewer to ‘use his or her judgment’ to fill the quotas. 

Thus as time goes on more and more personal choice is exercised in picking 
respondents—even style of dress can have an influence. To fill a quota of ‘over 
50 year old, professional class men’, the local doctor may well be deliberately 
omitted from a street interview if he is visiting the shops to replenish paint 
stocks, in the midst of painting the attic. This element of ‘determined choice’ 
means that we cannot be confident in applying the results above in the quota 
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sampling context. In particular, non-response which is inevitably ignored in 
quota sampling can seriously bias results; consider (for example) television 
or radio audience assessments, where non-response might be highly correlated 
with viewing or listening behaviour. 

This is not to say that quota sampling cannot produce very good results. It 
can, and often does. The difficulty is that we have no proper basis for measuring 
the properties of the sampling scheme since it is not truly probability based. 
(See Section 1.5 and Section 3.3.) Stephen and McCarthy (1958) give an early 
detailed appraisal of quota sampling methods and practice. 


5.7 Estimating proportions 


The earlier development of the properties of stratified sample estimators was 
restricted to the estimation of the population mean, Y, by the stratified sample 
mean, j,,. The results extend in an obvious way to the estimation of the 
population total, Y;. Two important further matters are the estimation of a 
population proportion, and the use of an auxiliary associated variable to. 
improve the efficiency of estimation of Y or Y;. A few results on these two 
topics will be derived in this section and the following one, respectively. 

Suppose that each member of a finite population can be assigned to one or 
other of two categories, and that P is the proportion of the population in the 
first category. It is of interest to examine how P might be estimated from a 
stratified random sample, and what the properties of the estimator might be. 
In a sense, the results derived for y,, answer this question: P can be regarded 
as the population mean value of a variable X, which is zero if population 
members fall into the second category, one if in the first category. Thus X = P, 
and the corresponding stratified sample mean X,, seems a sensible choice of 
estimator for P. 

We have 


k 
X st = y W:Xi, 
i=1 
with x; being just the proportion p, of the ith stratum sample members in the 
first category of classification. So, we can estimate P by 


k N; k 
Ps = pa ey Li bee Ww, is 
peice we sat 
and such a weighted average has an obvious intuitive appeal. 
Recalling the discussion of Section 2.10 which considered the estimation of 
P from an s.r. sample, the only essential distinction between the properties of 
Ps, and y,, will arise from the fact that the stratum variances S? must now 
depend on the quantities P,, the true stratum proportions. We have 


> N, 
S$; =——_ P, oS. ‘ 
yo P=) 
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Clearly 


Ps: is unbiased for P 


and, from (5.2), 


kK Wi (Ni-ni 
Var (pu)= 5 (AR 


N, )ra-P) (5.20) 


i=1 i 


If the N, are at all large, then the factors (N;—n;)/(N;—1) can of course be 
replaced by (1—/;). 
Special forms of this result for different allocations (n,,..--,Meieuand 
optimum choice of allocation, all follow from the results in Sections 5.1-5.4. 
In particular, with proportional allocation, we have 


N-nk* W; 
Var (pa)=—— © ap gy PIA Pd, 
-2-) ¥ wpa-P), 


(replacing (N;—1) in the denominators by N;). 

Adopting this latter approximation, the optimum allocations for fixed sample 
size ignoring costs (Neyman allocation) and for fixed cost, C = ¢ot+¥j-, (Cini), 
are 


5 WV P,(1 — P; ) 
and 


yr Wiv P,(1- P;)¢; 


respectively. (See Section 5.3.) The potential advantages of such optimum 
allocations over arbitrary or proportional allocation can be assessed from the 
appropriate forms of the results in Section 5.4. 


In practice we will not know the P,, and must use appropriate sample 


estimates. For example, we will estimate S? by the unbiased estimator 


n;pi(1— pi )/ (ni — 1). (See Section 2.10.) 
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5.8 Ratio estimators in stratified populations 


In Chapter 4 we considered some of the ways in which the existence of an 
auxiliary variable X, ‘correlated’ with the variable Y of principal interest, can 
be exploited to provide better estimators of characteristics of the population 
of Y values than are obtained from an s.r. sample of Y values alone. The 
ratio estimator, or regression estimator, provided this facility. 

One or the other was the more appropriate choice depending on the nature 
of the apparent relationship between the Y and X values (and depending on 
the extent of their association). If these estimators are to be recommended in 
simple random sampling from unstratified populations, there is reason to 
believe that the same may be true in stratified populations. To illustrate this 
we shall consider just one of the ways in which an associated auxiliary variable 
X can be employed in estimating characteristics of the population of Y values. 
The example we shall take is that of a ratio estimator of Y in a stratified 
population. 

Suppose we draw a stratified random sample of values of (Y, X), with 
n,,..., N, Observations in the different strata, the sample stratum means being, 
y;, X; (i=1,..., k). Suppose that in looking at the scatter diagrams of the data 
for each stratum there appears to be a fair degree of proportionality between 
the values of the two variables, shown in the form of a roughly linear relation- 
ship through the origin without substantial scatter. The slope need not appear 
to be identical in each stratum. Such an indication suggests that ratio estimators 
of stratum means or totals is likely to be profitable. (See Section 4.2.) We must 
assume, of course, that the stratum mean values X,, for the auxiliary variable, 
are known. 

There are various ways in which we can combine such ratio estimators for 
the different strata to yield an estimator for the whole population. Two 
possibilities are to use the separate ratio estimator, or the combined ratio 
estimator. 

Consider estimating Y. The ratio estimator of Y, is (j,/x,)X,, where j,, x 
are the stratum sample means. 


Then the separate ratio estimator of X is 
k i 
Vsst = >; Wi ht, 


i=1 i 


i.e. the weighted average of the separate ratio estimators. 

The combined ratio estimator reverses this process, forming first the stratified 
sample means and then correcting for the relationship between the two vari- 
ables. It has the form 


Vest = at “4 


The corresponding estimators of the total X; will be 
k — 
Ji 
2 — Xir 
i=] Xj 
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(X;7 is the total of the X values in the ith stratum), and 
Xt 


Xr, 


respectively. 

What of the merits of these estimators? Both will tend to be biased unless 
the sample size is reasonably large. The bias will be less serious in j,,, than 
in y,;, Since the sample size condition applies to the total sample rather than 
to each stratum sample. Approximate variances of y,,, and y,,, can be obtained 
by combining the results for the stratified mean and ratio estimator (typically 
(5.2) and (4.14)). It turns out that, unless the relationship between Y and X 
is the same in all strata, the separate estimator will be more ‘efficient’ than 
the combined estimator. But this must be offset by the lower tendency to bias 
in y.,,, and the fact that for this estimator we do not need to know the separate 
stratum means X;, only the overall mean X. 

The combined effect of stratification and the use of ratio estimators is 
somewhat unpredictable. It is appealing to think that potential gains from 
stratification should be further enhanced by using ratio estimators. 

We can sometimes do much better; particularly when the auxiliary variable 
does not serve as the basis for stratification. But it is not as simple as this: the 
two effects are often tied up. Stratification can have the effect of reducing 
(even annihilating) the potential advantage of the relationship between Y and 
X, by weakening the relationship within the strata. The effects are by no means 
additive and the best we can say is that by combining the two techniques we 
should do as well as the better of the two on its own, for the problem in hand. 
Coupled with the extra effort and knowledge necessary if Y,5, OF Ves; are to be 
employed, this makes their use problematical. 

Needless to say, all the separate practical and formal difficulties of stratified 
sampling, and ratio estimation, will enter into their use in combination (includ- 
ing problems of estimating variances, etc.). In terms of survey design the form 
of optimum allocation will tend to be somewhat different to that described in 
Section 5.3 when the ratio method is used. See Cochran (1977, Section 6.14). 

Analogous methods exist for using regression estimators in stratified popula- 
tions, where these are more appropriate than ratio estimators. 


Example 5.4 


The separate ratio estimator Js, (of mean height j, using weight x 
as a concomitant variable) has been calculated for 500 stratified 
random samples of size 5 from the Statistics Class data, with 
proportional allocation, and rows as strata. The histogram of values 
so obtained is shown in Figure 5.4; the mean and variance of the 
500 values of j,,, are 22.4 and 6.58, respectively. In Example 5.1 
we saw the dramatic effect of stratification by rows for estimating 
Y. Taking account also of the relationship between heights and 
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Frequency 


10 15 20 25 30 35 
Vest (TOWS) 


Fig. 5.4. Histogram of 500 values of y,,, (rows) for samples of size 5 in the Statistics Class 


weights in the population, by using j,,,, turns out to be far less 
successful. The variance of the estimator is increased, and there is 
noticeable bias ( Y = 23.9). 


5.9 Conclusions 


This chapter has covered a large amount of material to do with sampling and 
estimation in stratified populations. To set the results in perspective, we 
conclude the chapter with a review of some of their implications. This is 
conveniently achieved by posing, and briefly answering, a few questions. 


Why use stratified populations? 


(i) In the hope of obtaining more efficient estimators than would be 
possible without stratification. 

(ii) For administrative convenience; practical constraints of access or cost 
may compel different sampling techniques to be used for different 
sections of the population. The resulting data arise as random samples 
from the different sections; these sections constitute the strata in what 
is a stratified random sample. 

(iii) Because we are interested in the sub-populations (strata) in the own 
right; or wish to ‘represent’ such sub-populations ‘fairly’. 

(iv) To reduce fortuitous bias in an unstratified sample, by post-hoc 
stratification—but this has dubious utility. 

Individually, or in combination, these factors support the use of stratified 

sampling, either by deliberate construction of appropriate strata (as in (i) and 


(iv)) or in an inevitable form determined by practical constraints and interests 
(as in (ii) and (iii)). 
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What are the advantages? 


As implied by (i), (ii), and (iii), respectively, we would hope to obtain increased 
efficiency of estimation of population characteristics under appropriate circum- 
stances, additional sampling convenience or greater ease of access to sub- 
populations of special interest. 


When does stratification lead to improved efficiency? 


Population characteristics can be more efficiently estimated from a stratified 
sample than from an overall s.r. sample if strata means differ widely, and 
within-strata variation is low. The greater this effect the greater the efficiency 
of the corresponding estimators. With freedom of choice of strata, the aim 
should be to construct strata with these characteristics. If stratification is largely 
for administrative convenience the choice is limited and the efficiency improve- 
ment uncertain (although practical constraints do often produce a subdivision 
of the population appropriate to improvement of efficiency). 


How should the population be stratified? 


If unhampered by practical constraints, then clearly the aim should be to 
divide the population into non-overlapping groups of Y-values to maximise 
the separation, and internal homogeneity, of the strata. So the proper basis 
for stratification to achieve maximum efficiency is the set of Y-values itself. 
In practice, however, we will not have sufficient knowledge of the population 
to stratify it in this way. Instead we must employ some more tangible external 
criterion. If, as often happens, such a criterion corresponds reasonably well 
with separation of the Y-values into non-overlapping groups, little potential 
advantage from stratification will be lost. 

For example, stratification by sex should prove a good criterion when 
estimating measures of physical stature, stratification by geographic region, 
or occupational category, likewise in estimating socio-economic factors. 
Although inevitably tempered by sampling ease and cost and the knowledge 
of stratum sizes and variances, we should strive to stratify the population in 
a way that is likely to produce the sort of stratification required from the 


theoretical standpoint. 


How should the sample sizes be allocated to different strata? 


Proportional allocation is particularly straightforward, and will often extract 
most of the potential advantages of stratified sampling. It is commonly used. 
If reliable information on stratum variances and sampling costs is available, 
then optimum allocation (or, for constant unit sampling costs, Neyman alloca- 


tion) is to be recommended. 
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5.10 Exercises 

5.1 A survey is to be conducted to estimate the total number of books 
borrowed from the 217 public libraries in a county authority during a particular 
week. It is possible to classify the libraries as small, medium, and large in size, 
on the basis of their stocks of books. The numbers of books borrowed from 
libraries in the three groups are thought to be roughly in the proportions 
1:2:3. It is further anticipated that the within-group variances of numbers of 
books borrowed will be proportional to the square root of the corresponding 
means. There are 71 small, 126 medium, and 20 large libraries. 

A total sample size of about 40 is required. If sampling costs in each group 
are the same, how should sample sizes in a stratified simple random sample 
be allocated to the three groups? 


5.2 A stratified population has 5 strata. The stratum sizes, N;, and means 
and variances, Y, and S;, of some variable Y are as follows. 


Stratum N, Y, S? 

1 yb 7,3 1.31 
2 98 6.9 2.03 
3 74 112 1.43 
4 41 9.1 1.96 
5 45 9.6 1.74 


Calculate the overall population mean and variance, Y and S*. For a 
stratified simple random sample of size about 80, determine the appropriate 
stratum sample sizes under proportional allocation, and Neyman allocation. 
Work out (for the same total size of sample) the efficiency of the s.r. sample 
mean jy as an estimator of Y, relative to the stratified sample means for the 
two methods of allocation. 


5.3 A stratified population of total size N is made up of k strata of sizes 
N,, No,..., Nx. The ith stratum contains a proportion P; of members possess- 
ing a particular characteristic (i=1,2,...,k). If the f.p.c. can be ignored, 
show that the variance of the stratified simple random sample estimator of the 
overall proportion, P, of population members possessing the particular charac- 
teristic is approximately 


A A Pe a Fe 
= | a = os la Pi 
nN iei N 

when the stratum sample sizes are optimally allocated. 

Suppose that a population of size 10 000 has 3 strata with weights 0.3, 0.6 
and 0.1, and the stratum population proportions are 0.4, 0.6 and 0.3 feces: 
tively. A sample of size 100 is to be chosen. Determine the stratum sample 
sizes for equal allocation, proportional allocation, and optimum allocation 

Compare the efficiencies of the stratified simple random sample estimator 


of the population proportion for the three forms of allocation (the finite 
population correction can be ignored). 
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. 5.4 In a stratified population the sampling cost for obtaining a stratified 
simple random sample of size n, made up of n; observations from the ith 
stratum (i=1,2,...,k), is 


k 
C= Cat yy cn; . 
i=1 
The stratified sample mean, y,,, is to be used to estimate the population 
mean Y. 
Determine the optimum allocation of the n; (to minimise Var ( y,,) for fixed 
total cost C). 


5.5 All the farms in a county are stratified by farm size and the mean 
number of hectares of wheat per farm in each stratum, with the following 
results. 


Farm size No. of Mean wheat Standard 

(hectares) farms (hectares) deviation 
0-20 368 0 | 2.1 
21-40 425 8.1 3.6 
41-60 389 12.1 3.9 
61-80 316 16.9 SI 
81-100 174 20.8 6.1 
101-120 98 25.2 6.5 
121- 138 31.8 9.1 


For a sample of 100 farms, compute the sample sizes in each stratum under 
stratified simple random sampling with: 

(a) proportional allocation; 

(b) Neyman allocation. 

Compare the precision of these methods (for estimating the mean number 
of hectares of wheat) with that of s.r. sampling. 


6 


Cluster and multi-stage 
sampling 


A major problem in survey sampling is matching the study population and 
target population. Sometimes there is no list to constitute our sampling frame 
and from which we can choose at random individual members of the population. 
Instead, the sampling frame often consists of a division of the target (or study) 
population into non-overlapping groups of population members. All population 
members are represented once only in these groups, which may be of different 
sizes. There may be a convenient list of the groups in the population, which 
can be used for specifying the sample that will be sought. 

The sampling frame thus provides a coverage of the population of interest, 
but its members (the sampling units) do not correspond to individual members 
of the population. Such loss of identification of individuals is offset by the 
great convenience of having a tangible list of sampling units in which to define 
a sample and, frequently, by practical advantages of cost or access in contacting 
chosen sample members (which are of course sampling units, or groups of 
members of the population). For example, a list of addresses might be a 
convenient basis of access to individuals in households—but each address 
may correspond to several people. Or, in an enquiry into the performance of 
schoolchildren, choice of a sample of schools is likely to be easier and less 
expensive than choice of schoolchildren individually (irrespective of whether 
or not a complete list of schoolchildren exists). 

Note how such dual considerations of the convenience of a list for specifying 
a required sample, and cost and access advantages, imply that we essentially 
sample from a stratified population. The strata are in fact the sampling units, 
which are likely to be many in number, each containing relatively few popula- 
tion members. But methods of stratified sampling are unlikely to be appropriate. 
As described in Chapter 5 these involve sampling from each stratum. The 
administrative advantage in using a sampling frame with units representing 
many small strata lies in being able to restrict the selection of such strata. Thus 
instead of selecting some population members within each stratum, we will 
wish to select some strata but possibly study each selected stratum in full. 

This difference of emphasis is reflected in different terminology. The strata 
are now called clusters; the choice of a sample of such clusters to yield a 
sample of the population members is called cluster sampling. 

If all the population members in each selected cluster are used in the sample, 
the method is known as one-stage cluster sampling. If not, but further selection 
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is exercised within the chosen clusters, the technique is called sub-sampling, 
or two-stage cluster sampling. Sometimes the members of the clusters themselves 
consist of groups of population members. The clusters are then called primary 
units, their constituent sub-groups secondary units, and so on. Selection of a 
sample by choosing from among the primary units, from secondary units within 
chosen primary units, and so on to further stages, is called multi-stage cluster 
sampling. For instance, in examining performance of schoolchildren in the 
educational survey referred to above, local education authorities may constitute 
the primary units, their schools the secondary units, classes within the schools 
the tertiary units, and children within the schools the members of the study 
population. 

In the discussion of stratified sampling in Chapter 5, it was recognised that 
the manner in which a population is stratified may be conditioned by 
administrative factors. Nonetheless, the major interest in stratification is in its 
potential value for producing more efficient estimators of population characteris- 
tics. In contrast, cluster sampling is employed almost exclusively for administra- 
tive convenience; either to ease sample specification through the existence of 
a list of the clusters, or to improve access to the population, or to reduce 
sampling costs. Cluster sampling methods are, and need to be, widely 
employed. Often other methods, were they feasible, would produce more 
efficient estimators but at much greater cost and administrative effort. 

The hope is that any loss in potential efficiency is outweighed by reduction 
in sampling costs, and greater sampling facility. Any objective comparison of 
cluster sampling with other methods needs to be made on this basis: it is 
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unrealistic to exclude cost considerations in such a comparison, although the 
accurate specification of costs is not always easy, and it would be optimistic 
to claim that cluster sampling in practice is often supported by a justificatory 
cost analysis. Pragmatism is the major stimulus! 


6.1 One-stage cluster sampling with equal 
sized clusters 


As in our study of other sampling methods, we shall again restrict attention 
to just one or two aspects of cluster sampling. Detailed discussion will be 
confined, in the main, to one-stage and two-stage cluster sampling. Multi-stage 
schemes are, of course, often used and details are readily accessible elsewhere 
(e.g. in Som, 1973). We shall not further study two-phase (or multi-phase) 
sampling (see Section 5.5) for prior estimation of cluster variances or correla- 
tions. Neither is there space to take up the issue of how to choose the basis 
of clustering, when several possibilities exist. 

We start with one-stage cluster sampling. Suppose that the population 
consists of a set of clusters of individual population members. There are M 


clusters of sizes N,, N>,..., Nm ‘at N; = N). The members of the clusters 
ater age — ly2, s+ > ay st 1,2,...,.N;), and the cluster means and cluster 
variances are Y,; and S; (i=1,2,...,M), defined in the usual way. The 


population mean and variance are Y and S’, respectively. 

A sample of population members is obtained by taking a simple random 
sample of m clusters and including in the sample all members of the chosen 
clusters. The resulting sample, of size n> m, is a one-stage cluster sample. How 
are we to use it to estimate Y? 

The simplest case to study is where all clusters are of the same size. 

Suppose 


Then 
N= ML. 
The one-stage cluster sample, arising as an s.r. sample of m clusters, has size 
n=mL; 
the sampling fraction is 
f=n/N=m/M. 


Suppose the observations in the sample are y, (i=1,2,..., m; j=1,2,...L) 
he As an estimator of Y, the cluster sample mean might be considered. This 
1s just 


Vet = 


1 ed 


1 ie 
mL ey ee (6.1) 


lj 
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Since all clusters are of the same size, j.; is clearly unbiased for Y; that is 


Furthermore its variance is easily shown to be 


Var (Va) == (1=f) 2 Pea 


Se eal (6.2) 


These results arise from the fact that y., can be expressed as 


| oe 
ae ye Vis 
mM j=1 
where the jy, are the cluster means for an s.r. sample of m of the M clusters. 
Regarding the set of cluster means {Y;, Y2,..., Yy} as the basic population 


from which we are sampling, this population has mean 


M 
Y= Y 
i=] 
and variance 


15 (¥%,-¥) 
M-1i2 © 
The unbiasedness of y,, as an estimator of Y, and the form (6.2) for its variance, 
now follow directly from the earlier results (see (2.3)) for a s.r. sample mean. 
Consider the alternative estimator of Y provided by the mean, y, of an s.r. 
sample of n(=mL) observations drawn without restriction from the total 
population (ignoring the cluster structure). This is also unbiased, and has 
variance (1—f)S’/(mL). How does this compare in efficiency with y,,? 
Note that 


ie 1)S = YS Xy YY 


i=1j=1 


M _ -_ 
=M(L-1)S’+L ¥ (¥i- 1), (6.3) 
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where 
== 5 
Mis | 
is the average within-cluster variance. 
Thus 
Var (y) — Var (Ya) ar 1)S’-L 3 6 Coe A 
‘ mL(M —1) i=1 
= L- = 
_ Seg si: (6.4) 
mL(M —1) 
by (6.3). 


So the question of the relative efficiency of y and j,, (in terms of a straight 
comparison of their variances, ignoring cost or convenience factors) has a 
simple resolution. The cluster sample mean, y.,, will be better than the s.r. sample 
mean, y, if the average within-cluster variance, S*, is larger than the overall 
population variance, S*, and vice versa. 

It is interesting to observe that this is essentially the reverse of what was 
found in stratified sampling, where the method yielded greater efficiency if 
the within-strata variation was sufficiently /Jow. Bearing in mind the basic 
difference in the two sampling techniques (cluster sampling and stratified 
sampling), this reversal is what would be expected intuitively! 

There is another way of looking at these results. We can define a quantity 
called the intra-cluster correlation coefficient, 


M 
p—2 2 ZX (¥y— YY - YUL 1)(ME- 18°), (6.5) 
i=1lj< 
which provides an aggregate measure of the correlation between population 
members in the same cluster. This clearly has affinities with S$’; the larger the 
value of p, the smaller, in general, we would expect S’ to be. In fact, 


i Mp1) \s? ee) 
(see Exercise 6.2 at the end of this chapter), and we can express Var (j.,) as 
: (1-f) ML-1 
Var (y-1) = ———- $? - 
(Ver) Corey 1)p). (6.7) 


The condition, $*> S$’, for ¥., to be more efficient than y, now becomes 


1 


i 
CEST Fe 
so that, as long as the population is large, the requirement for greater efficiency 
of y., is that the intra-class correlation should be negative. (Again, see Exercise 
6.2 at the end of this chapter.) 
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It should be recognised that, since the prime stimulus for using a cluster 
sample is one of convenience, the greater efficiency of j,, over j, although 
feasible, is not likely to be widely encountered nor is it greatly important. 

The population total Y; can, of course, be estimated by MLy,,, and its 
variance is merely the appropriate multiple of (6.2). 

In practice we will need to estimate Var (jy.,), and an unbiased estimator is 
obtained by replacing y Se (Y;-— Y)?/(M —1) in (6.2) by 5", (¥;-Fe?)/(m— 
1). Under appropriate conditions we can again construct approximate 
confidence intervals for Y, or Y;, by assuming that y,, is normally distributed. 


Example 6.1 


Consider, once again, the data on heights of students given in the 
Statistics Class. We could consider either (i) the rows, or (ii) the 
columns, as 5 equal-sized clusters in the population. Picking one 
row, or one column, at random yields a cluster sample of size 5. 
To estimate the mean height Y the cluster sample mean might be 
considered in either case. Note that only 5 possible values can arise 
(in each case) and they do so with equal probabilities. The two 
sampling distributions are thus particularly simple. They do not 
need to be estimated by taking a large number of samples, as was 
done previously to illustrate results for s.r. sampling, ratio estima- 
tion, or stratified sampling. The possible values for y., (and the 
within-cluster variances) are (see Example 5.1) in case (i) 


11.0 210° 236 32.4 29.4 
2.0 Teams 393 493 


and in case (ii) 


25.2 . 22 wee 23.8 26:8 
68.2 952-7397 161.2 922 


The average within-cluster variances are 26.5 and 91.3 respectively. 
The exact sampling distributions are presented in Figure 6.1. The 
exact variances of y,, are easily obtained by working out the vari- 
ances of the possible y,, values in cases (i) and (ii) (that is by using 
the result (6.2)). We obtain, in case 

(i) Var (Yu) = 56.0, 

(ii) Var (¥.1) = 4.2. 
The variance of an s.r. sample mean, ¥, based on an s.r. sample of 
size 5 from the whole population, was found to be 12.87. Thus we 
confirm the results above for cluster sampling. When the average 
within-cluster variance is smaller than S? (=80.4), y is better than 
5. This is markedly so in case (i). When it is larger than S’, J. is 
better than y. This happens in case (ii). 


146 Cluster and multi-stage sampling 


(i) 


£ 
8 0.2 
© 
a 
10 15 20 25 30 35 
Yo (rows) 
= (ii) 
= 0.2 
2 
E 
a 
10 15 20 25 30 35 
Yi (columns) 
Fig. 6.1. Sampling distributions of y,, for (i) rows, (ii) columns for samples of size 5 in the 


Statistics Class 


We note the expected contrast with stratified sampling, where stratification 
by rows produced a large improvement in efficiency over y; by columns, not 
so! 


The same principles apply for estimating the population proportion, P, of 
Y-values satisfying some criterion. Suppose P; (i= 1,2,..., M) are the corre- 
sponding cluster proportions. Our sample yields m of these; p,, po,.--, Dm- 
The cluster sampling estimate of P is now 


1 m 
Pa=— ¥ Di, 


which is just a special case of j.,, where Y is assigned new values 1 or 0 
depending on whether or not it satisfies the criterion of interest. The variance 
of p.. is obtained from (6.2) by replacing the Y,, and Y, by the P., and af 
respectively. 

Before proceeding to discuss what happens when the clusters have different 
sizes, we shall consider a special form of cluster sampling which is very widely 
used in practice. 


6.2 Systematic sampling revisited 


We noted in Section 2.7 the initial attraction of systematic sampling: taking 
observations at equally-spaced intervals from a listing of the finite population. 
We also remarked on the possible dangers that might arise if there were 
systematic trends or groupings in the list. One difficulty was the fact that this 
approach does not necessarily correspond with s.r. sampling. 

However, the sampling mechanism is in fact well-defined. We are essentially 
taking a cluster sample of size m=1! We can see this as follows. 
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Consider a systematic sample chosen by taking one member at random from 
the first M on the list, and every Mth subsequent one. Suppose this yields a 
sample of size n. The population can be thought of as made up of M clusters: 


X,, Xm+1> X2mM+15 Adin 
X, Xm+2> X2M+2> a 


Xm; Xom; X3m;> ese 


> 


of sizes differing by at most one. The systematic sample is just one of the rows 
drawn at random and is thus a cluster sample, consisting of one cluster chosen 
at random from the M clusters. Suppose N = LM, so that all clusters have 
the same size L=(N/M)=n; we can immediately apply the results of Section 
6.1, with m=1. If N/n is not an integer, so that the cluster sizes are not all 
exactly the same, minor modifications will be necessary in the terms of the 
next section, but the effects are qualitatively unaltered. 
To estimate the population mean Y we take the systematic sample mean 


1 n 
Ls 2 Yi» 
nN j=1 
where y;, Y2,---, Yn are the observations in the single chosen cluster of the 


population. In the notation of Section 6.1, y, is just J. based on m(= L) 
clusters chosen at random from the M=WN/L ‘systematic’ clusters of size 
L=n into which the population has been divided. So m=1, L=n, M=N / Nn, 
f=n/N. We conclude that 


so that y, is unbiased, and by (6.2), 


(6.8) 


where Y, is the mean of the ith cluster (or the ith of the M potential systematic 
samples). Equation (6.3) becomes 


(N —1)S*=M(n —1)S?+ Mn Var (y;), 


so that if y is the mean of an s.r. sample of size n 


-li@ 2 
Var (7) — Var (j)=—— (8 ei 
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The systematic sample mean will thus be more efficient than jy if S18": 
that is, in the event that the average variance within systematic samples is 
larger than the overall population variance. This can sometimes happen, and 
endorses the intuitive feeling that systematic sampling (as well as being easy 
to carry out) can be statistically advantageous if systematic division of the 
population results in widely differing Y-values in each potential systematic 
sample. In practical terms this effect depends on the way in which the popula- 
tion has been listed (the listing may be highly structured in terms of Y values, 
or at the other extreme essentially random). Cochran (1977, Chapter 8) con- 
siders this matter in some detail. 

The results can again be expressed in terms of the intra-cluster correlation 
p. We have, from (6.7), that y, has 


var Gi)=(1 -)[1+(n=1)p]8"/n, 


and efficiency in excess of y if 
p<-—1/(N-1)~0 


that is (essentially) if p is negative. 

If N/n is not an integer, some potential systematic samples will have one 
more member than others: in this respect the clusters in the population are 
not now all of the same size. The effect of such (small) differences in cluster 
sizes will be qualitatively unimportant unless the systematic sample size is very 
small. 


6.3 One-stage cluster sampling with different 
sized clusters 


Suppose as before that the population consists of M clusters, but that their 
sizes are N,, N>,..., Nw (%;_, Ni= N) where not all the N; have the same 
value. A cluster sample is again drawn as the basis for estimating some 
population characteristic, say Y. The cluster sample consists of all members 
of each of m clusters randomly selected from the M clusters in the population. 
Suppose that the sizes, means, and totals of the chosen clusters are n,, j; Yi 
G= 1m). ae 

Certain complications now arise because the cluster sizes differ; various 
alternative estimators of Y might be considered, and their sampling behaviour 
is not always easy to determine precisely. We shall consider here just three 
possible methods of estimating Y (with obvious extensions to the estimation 
of Yr or P); one further possibility is discussed later (Section 6.4). 

It is useful to distinguish between the primary units (the clusters) and the 
secondary units (the population members within the clusters). The cluster 
sample is an s.r. sample of the primary units, and we can Carry Over the results 
for S.1. sampling given in earlier chapters to study its behaviour. We are 
essentially sampling a population of size M, where each member is represchtha 
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by certain variables: for example, the cluster total Yir, the cluster mean Y,, 
or the cluster size N,;; for (i=1,2,...,M). The cluster sample provides 
observed values yj7, ¥;, nj (i=1,2,..., m) of these variables. 

The overall population mean Y (that is the mean value of Y over all 
secondary units) is 


- Fe eee M M M 

= my, / 5 N= ap Ni, ? (6.9) 
i=1j=1 i=1 i=1 i=1 

which is interpretable in the ‘reduced’ population of primary units as a 

population ratio of the total of the Y;; values to the total of the N, values. If 

we know both the number of clusters M, and the total overall population size 

N=>", N,, then writing 


Y= (=) X Yi7/ M, (6.10) 


Y is alternatively represented as just a known multiple of the primary popula- 
tion mean value of Y;7 (i.e. of the mean cluster total Y; =(1/M) pyre Y;7). 

These representations suggest two possible ways of estimating Y from the 
cluster sample. 


(a) The cluster sample ratio 1 
Use of the results of Section 4.1 suggests estimating Y by 


Ycay= L Vir / LM, (6.11) 
i=1 i=1 
which is the ratio of the sum of the cluster totals to the sum of the cluster 


sizes, in the chosen sample of clusters. 
Thus (see Section 4.1) j-;a) will have bias of order m' which will be 


unimportant only if the number of clusters in the sample is large. The variance 
of ¥.(a) Will be given by the approximation (see (4.5)). 


im) MINN 
var Cn ne 2 (*) (YH PR (6.12) 


This variance depends on the variation between the cluster means: the smaller 
the variation, the smaller the variance. The effect is similar to what was found 
in the case of equal sized clusters; compare (6.12) and (6.2). Var (¥.(a)) can 
be estimated from the sample by 
aes (2) aig 2 
ToL Se re ( fmol ) ’ 
m(m—1) 2 Ny Ze) 


and if N is unknown, it may be replaced by the sample estimate Mn/m to yield 


(M=m)m s (2) apap 
MGn=1) 2 \n) oe.) 
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To estimate the overall total Y; we merely use NYeva 5 but here knowledge 
of the value of N is essential. The variance is just N* Var (Vcc). 


(b) The cluster sample total % 
If the total population size is known, (6.10) suggests that we estimate Y from 
the usual s.r. sample estimate of a population mean. This implies using the 
estimator 


M m 
Ye(b) =e u Jit- (6.13) 


Clearly, 
M — 53 
E (Veco) = 57 Y7= Y, 


so that j.,,) has the advantage of being strictly unbiased. 

Its variance is, from (2.3), 

(M-m)M * o \2 

Aicipante (6.14) 
Since Y; =(N/M)Y, (6.14) differs from (6.12) merely in the fact that the 
sums of squares of the cluster totals is calculated about the fixed quantity 
MY’/N rather than about the individual N,Y which depend on the specific 
cluster sizes. 

This implies that (6.14) will tend to be larger than (6.12), since cluster totals 
are most likely to be positively correlated with cluster sizes. This possible 
disadvantage needs to be set against the attraction of the unbiasedness of 
J-(p). But the greater efficiency of ja) is far from guaranteed: many factors 
enter into the comparison, including the relationship (if any) between the ¥; 
and Nj, and the variability of the cluster sizes. If the N; do not vary greatly, 
then Var (y.(a)) need not be much different to Var (y.,,)) (indeed for equal 
sized clusters they are identical). 

The estimation of Var (j.,,)) from the sample data, and corresponding results 
for estimating the population total, follow in the obvious way. Again there is 
the advantage that N need not be known for estimating Y7;. 

Yet another estimator of Y, which has as its principal attraction a very 
simple form which is easily calculated, is obtained as: 


var (Ye»)) = 


(c) The unweighted average of the chosen cluster means 
That is, we use 


: Ni ge 
Yee = TL Ie (6.15) 


This estimator is biased and inconsistent (in the finite population sense) unless 
all cluster sizes are the same. It provides a useful ‘quick estimate’ which will 
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not be too seriously biased unless the cluster means and cluster sizes are highly 
correlated. Its variance is again obtained from (2.3), as 


var (p pee. me 5 yy) 
Yer) = MI (M —1) x or ¥.)*, (6.16) 
(where Yo =(1/M) » ae Y;), which can be estimated from the sample by 


M-m yee 
Tyee 2 
ma Gey (y; Vee) : 
(If the bias is substantial, however, then the expected mean square error can 
of course be much larger than (6.16).) 
The expected value of j.,.) is Yc, so that its bias is 


— ee 1 M af 
Yop MD fee Ot 
i=1 N i=1 
If the Y,, or the N,, do not vary too much, this bias will not be serious, and 
the estimator can compare reasonably in efficiency with y.(a) OF Ve(s)- 
The corresponding estimator of Y; is Ny.,-), with variance N° Var (¥-,<))- 


6.4 Cluster sampling with probability proportional 
to size 


In Section 2.8 and elsewhere we have considered the implications of replacing 
s.r. sampling from the overall population with a sampling scheme in which 
the probability of choosing a sample member was in some way related to its 
corresponding Y-value (where Y is the variable of interest). Another situation 
in which samples may usefully be drawn with different probabilities attached 
to the occurrence of different population members is in the sphere of cluster 
sampling. This will be illustrated for one-stage sampling although the technique 
has (perhaps greater) advantages when applied to the choice of the primary 
units in multi-stage sampling. Suppose it is necessary to estimate the population 
total, Y;, for a population which consists of M clusters of sizes Ni, N2,--- Nu 
Ou. N,=N) with cluster totals Yir (i=1,2,...,M). For this purpose a 
single stage cluster sample, of m< M clusters, is to be chosen. For simple 
random sampling the results have been presented in the previous Section. In 
many situations we will find that the larger clusters contribute the larger values 
Y,, towards the total Y7. It might seem sensible, therefore, to give greater 
attention to such clusters, perhaps by giving them larger probabilities of 
occurrence in the sample. 

To keep the analysis fairly simple, we shall (as in Section 2.8) consider 
sampling with replacement, in such a way that the ith population cluster (of 
size N,; i=1,2,..., M) has probability p; of being chosen. From the results 
of Section 2.8 we immediately conclude that the ideal value of p; is Yir/ Yr. 
ZO WEALTH O® 


0662 ee 
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The corresponding estimator of Y;7 is then 


with zero sampling variance. But again this is not feasible, since if we knew Yr 
we would not need to estimate it. However, an associated alternative measure 
of cluster size which may be known is the number of population members, N;, 
in the cluster. This suggests sampling the clusters with probabilities p; = Nj/N, 
and the estimator becomes 


Li 3 Vis (6.17) 
which is just N times the average value of the means of the chosen clusters. 


Using an argument similar to that employed in Section 2.8, we see that this 
estimator is unbiased, and has variance 


Var (yr) = XN (Y,- Y)* (6.18) 


Corresponding to (6.17) we have an estimator of Y of the form 


y= r/N= Je 


M3 


i 

mM | 

which is unbiased and has variance 
Var (y) = var (yr)/ N°. 


We can obtain unbiased estimators of Var (y) and Var () from the sample, 
in the form 


: m 


sr) =——— ¥ (i, -$), 


and 


28 1 
m(m-—1) j= 


The estimator y above is again described as being based on probability 
proportional to size (pps) estimation; it is a serious competitor to the three 
described in Section 6.3. Since the sampling has taken place with replacement 
we must expect the procedure to be somewhat wasteful of effort: reflected ss 
the value of Var (y). But even sO, circumstances can arise (in particular when 
the Y, and N, are more or less uncorrelated) in which y has similar efficiency 
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to Pua), and both are more efficient than j.,,) or P<). Yr has the added 
advantages of unbiasedness and ease of calculation, although the sampling 
procedure is more difficult and can be more costly (in view of the emphasis 
on the larger clusters). A more detailed comparison of these estimators is given 
by Cochran (1977, Chapter 9A). 

Sometimes this approach may need to be modified. If the cluster sizes N; 
are not known precisely, then it may be necessary to use an alternative measure 
of cluster ‘size’, perhaps an estimate of N; or some other quantity likely to be 
positively correlated with the N;. For example, suppose we wish to sample 
primary schools throughout the country by taking a one-stage cluster sample 
of local education authorities. If knowledge of the numbers of schools for 
each authority happened not to be readily available, certain other factors might 
well be: such as expenditure on school education in the regions covered by 
the authorities, or the total populations in these regions. Either of these will 
be highly correlated with the number of schools, and will constitute a reason- 
able basis for probability sampling. Such sampling is now referred to as 
sampling with probability proportional to estimated size (or ppes sampling). 

So we now assume that each cluster has such a measure, Z;, associated with 
it, and use this as the basis for ppes sampling, which consists of choosing m 
clusters, with replacement, where at each stage of drawing the sample, cluster 
i has probability pj = Z,/Z7 of being chosen G12). 7 ae ZY. 

X, and X are now estimated by 


Zre mast 
pia T/L; ae) Virl ea 
m 2 a a mN ix Yer! 


respectively. These estimators are unbiased, and (by (2.11)) have variances 


pepesZ, Vir 2 1 {5 ZrY it "| 
peas cat ain se plete ———— y*}, 
2 Z 1s Bee m Py INA Z, 


respectively. ig 
In the light of what we saw of the effect of sampling with probability 


proportional to the cluster total, the most advantageous measure of size in 
ppes sampling will clearly be that which is most nearly proportional to the 


cluster total. 


~—. 


6.5 Multi-stage sampling 


There are many ways in which the cluster sampling method may be modified 
or extended to cope with the specific demands of more complicated situations. 

One example arises where the selected clusters in the primary cluster sample 
are themselves sampled, rather than fully inspected. Thus if we choose an s.r. 
sample of m of the M clusters (primary units ) which comprise the population, 
we may then take s.r. samples of sizes n,, M2,--->"%m of the secondary units in 
the chosen clusters. Typically we have n;< N,(i=1,2,..-, m), in contrast to 
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the sampling scheme of the previous section where n,; = N;. The process of 
drawing samples from the selected clusters is called sub-sampling: the resulting 
total sample of size n =>”, n; is called a two-stage cluster sample. 

If the secondary units are the individual members of the study population, 
there is no point in going further. If instead they consist of groups of population 
members, then we might either use all their constituent members, or proceed 
to further stages of sub-sampling. In this latter case we encounter multi-stage 
cluster samples, corresponding to progressively higher levels of sub-sampling. 

As described, the probability sampling mechanism at each stage is simple 
random sampling. 

Such multi-stage sampling can be illustrated for the school survey referred 
to in the opening pages of this chapter. In seeking a sample of school children, 
it could be particularly convenient to regard local education authorities as the 
primary units, schools under their control as the secondary units, and children 
in those schools as tertiary units. A three-stage cluster sample (an s.r. sample 
of education authorities, an s.r. sample of schools under each of the selected 
authorities, an s.r. sample of children in each of the selected schools) has the 
advantages of being relatively easy to obtain, and of appearing to ‘cover the 
population in a representative manner’. Additionally, without a complete listing 
of schoolchildren it would be a clumsy procedure to seek a direct s.r. sample 
from the whole population, whilst limited financial resources could imply that 
very few authorities (perhaps only one) would be chosen in a one-stage cluster 
sample, with the resulting risk of serious regional idiosyncracies. 

Thus the prime stimulus for multi-stage sampling is again administrative 
convenience, although scope exists for taking into account variance and cost 
considerations in the specification of what sizes of sample to take at the 
different stages. 

Only a simple illustration of multi-stage sampling will be considered here; 
discussion of more complicated structures and more sophisticated probability 
sampling schemes can be found in the various texts on sampling theory which 
appear in the Bibliography (particularly Cochran (1977); Hansen, Hurwitz 
and Madow (1953); and Som (1973)). 


6.6 Two-stage cluster sampling 


Suppose we have a population consisting of M clusters, each of similar size 
L, and we draw a two-stage cluster sample by taking | members at random 
from each of an s.r. sample of m clusters. We assume that the cluster elements 
are the individual members of the population, and we wish to estimate the 
mean value Y of some measure Y defined on these members. The assumption 
of equal-sized primary units and equal-sized sub-samples at this stage simplifies 
the discussion. 

The total sample size is n = ml; the sample members are Vote SA 2) kay m* 
j=1,2,..., 1); the within-cluster sample means are denoted PAU &.., mi: 
We denote by Y,(i=1,2,...,M; j=1,2,..., ZL), ¥,(i=1, 2,..4, M), and ¥. 
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respectively, population members, the cluster means, and the overall popula- 
tion mean. 


The simplest estimator of Y is the analogue of the one-stage cluster sample 
mean y,, (see (6.1)), namely 


p Yi- (6.19) 


It is readily confirmed that y/, is unbiased for Y, and that it has variance 


M-m™ (¥,-¥¥_L-I™ & (¥,-¥/) 
S.7-— — e 5 ee 
mM i=1 M-1 Mel 1 j=1 M(L-1) 


Var (Yu) = (6.20) 


iu Eebs; 
= Var (¥.1) + mLI Ss’, (6.21) 
where S° is again the average within-cluster variance. 

As required, Var (¥/,) reduces to Var (¥,,) if ! = L, that is for one-stage cluster 
sampling (complete inspection of each selected cluster). When 1 <.L, the 
variance is increased by the amount (L—1)S*/mLI, a contribution which arises 
from the further sampling variation due to subsampling the selected strata. 

But care must be exercised in interpreting (6.21)! Whilst it literally declares 
that y/, has larger variance than y,,, this latter quantity requires full inspection 
for all m selected clusters so that the sample size is mL. But yi, is based on 
a sample of size ml which is typically Jess than mL. The fact that a smaller 
sample yields an estimator with larger variance is no surprise—in itself it does 
not reflect on the relative efficiency of estimation of Y in one- and two-stage 
cluster sampling. 

The proofs of the unbiasedness of Yu, and of the form (6.20) for its variance, 
are quite straightforward using conditional expectation arguments: taking 
within-cluster expectations conditional on the selected clusters followed by 
the marginal expectation with respect to the s.r. choice of clusters at the first 
stage. The details will not be presented. 

The population total, Y7, can be estimated by Ny’). It is unbiased and has 


variance N” Var (¥11). . 
In practice we will need to estimate Var (j/,). We can obtain an unbiased 


estimator as 

M-m 
Mm i 

At first sight, the second term in (6.22) contrasts strangely with the second 


term in (6.20): the divisor m in (6.20) has been replaced by M. The reason 
for this is that with incomplete inspection of the selected clusters, 


es 4, m _=)\2 
y (¥;- Fu) L—l OF; yi) (6.22) 
=1 


m-1 Millet ke) m(1—1)_ 


s*(Fu) = 
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is not an unbiased estimator of 

MA) 

i=] M =a ‘ 
and a compensation is needed which changes the divisor in the second term 
of (6.22) in the way described. 

One case where this estimated variance (6.22) is particularly easy to obtain 
is where only a small proportion of the clusters is sampled. This is often the 
case in complex populations. Then the second term becomes negligible and 
we can estimate Var (y/)) simply from the values of the sample means of the 
selected clusters, as 


M-m ™ (ji-Ja)” 
Ma ian. mi 

Whilst the decision to take a two-stage cluster sample is usually dictated by 
administrative considerations, some choice can often be exercised in the values 
of m and |. Within any limitations imposed by the total sample size, or total 
cost of sampling, different values of m and / will yield estimators with different 
variances, and the question of optimum choice arises. If substantial sampling 
costs occur only at the sub-sampling stage, and are the same in all clusters, 
then optimum choice of m and / for fixed total cost amounts to choosing m 
and | to minimise Var (j/,) for a prescribed total sample size ml. But such a 
cost structure is not often realistic. 

When primary sampling and sub-sampling both involve costs, a more compli- 
cated cost model is needed. The simplest way in which such differential costs 
may be included is through a model which declares that there is a basic 
overhead cost dy, that each selected primary unit costs an additional amount 
d,, and each secondary unit an increment d,, so that the total cost of sampling 
is 

C =d)+d,m+ d,ml. (6.23) 

Optimum choice of m and | now requires Var (j/,) to be minimised subject 
to the constraint (6.23) in which C is the total amount of money available for 
sampling. As in our study of stratified sampling, the dual problem of minimising 
the total cost of a prescribed Var (j/,) does not require separate study. The 
optimum value of / is the same in both situations, being approximately 

Vd,/d, S(Sw-S*/L)"”, 
where Sw => /., (Y;— Y)?/(M~—1). The corresponding value of m may be 
obtained, as appropriate, from the prescribed value of C or Var (¥/,). 

In practice, the slight departures from the optimum value of | which are 
aero to arise from the inevitable uncertainty of the values of d,, d,, S*, and 
Siw are likely to lead to little loss of precision relative to the truly optimum 
choice. 

When S\v — S*/L is sufficiently small, optimum choice of / will be L, so that 
one-stage cluster sampling is indicated as the best policy. This makes sense 
intuitively, since the variance of the cluster means, S{y, is now much the same 
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as S*/L, with the implication that elements are assigned to clusters more or 
less at random. If so, complete inspection of a few clusters should be as 
efficient as partial inspection of many for the same total size of sample, n = ml. 
At the same time it will, from (6.23), be less costly, since d)+d,ml remains 
fixed whilst d,m is minimised. 

A much more detailed discussion of the properties of estimators and of the 
optimal choice of primary, and sub-sample, sizes in multi-stage sampling is 
given by Cochran (1977, Chapters 10 and 11), for this present situation and 
also for progressively more complicated situations (with unequal cluster sizes, 
stratified populations, more sophisticated probability sampling schemes). 

We shall consider here just one extension: two-stage sampling with unequal 
cluster sizes. For illustration of this situation, where we relax the unrealistic 
assumption of equal-sized primary units, we shall consider only the simplest 
case of a single estimator of the population total Y; (or mean Y) using s.r. 
sampling. Of course more complicated estimators are feasible, using the distinc- 
tions drawn in Section 6.3. In practice we might also want to extend our 
coverage to proportions, to employment of some concomitant variable X 
observed along with Y, or to more complicated sampling schemes (e.g. with 
probability proportional to size). Estimating proportions poses no funda- 
mentally new considerations, but the other complications take us beyond the 
scope of this present treatment. (Cochran (1977) and Som (1973) extend the 
coverage into some of these areas). 

Thus suppose we again have a population of M clusters, but now of possibly 
differing sizes L,, L,,..., Ly. We draw a two-stage cluster sample by choosing 
m clusters at random and then drawing s.r. sample of sizes /; (i=1,2,..., m) 
from the chosen clusters. 

Let the second-stage sample means and variances be Jy, and s; 
(i=1,2,...,m); and the population characteristics Y¥;, Yeeread,_S; 
(i=1,2,...,M). The overall sample consists Gton=),, il observations: yj 
(i= tee rey — 1, 2, : The: overall population has mean Y, total Y7, 
variance S? and size N=)" Li. 

A commonly employed estimator of Y7 is 


(ee = Li, (6.24) 


which is intuitively appealing—each Ljy, estimates the corresponding cluster 
(first-stage unit) total y;7; this is summed over the chosen clusters (first-stage 
units) and further enhanced by the factor M /m to provide an estimate of the 
overall total Y;. It is easily confirmed that Y, is unbiased. 

An unbiased estimator of the variance of Yr is given by 


A _M(M-m) ¢& ae ae 
s*(Yr)= m(m-—1) au (Lyi Yr/ ) 
+ F 1 (L-1)s?/ th. (6.25) 


1 
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Note that if we denote by f, and /,; the first-stage and second-stage sampling 
fractions, m/M and |;/L;, the estimated variance (6.25) can be rewritten as 


5°( Pp) =A F (15. Pr/ MY Mm 1) 


it Pig 
~—355 & Lif) si/t 
i=1 


where L=))" L;/M. 

Corresponding estimates for the population mean Y are readily obtained 
by using Y = Y;/N with estimated variance s 2(Y,)/N?. 

An alternative estimator of Y is 


y= LL 5/5. es (6.26) 


which takes the form of a ratio estimate (see Chapter 4). It will be biased but 
this should not be serious if m (the number of sampled first-stage units) is 
large. A sample estimate of the MSE or Y is given by 


N , : 4 
NOAA) F p25, - 2/(m=-1) 
mL? fas 


NES 


mse ( Vins 


fri)si/| 


which is basically similar in form to (6.25). 

Yet another approach is to use sampling with probability proportional to size 
for the first-stage selection. A natural pps principle would be to sample with 
probability proportional to the size of the clusters (i.e. proportional to L;) if 
this is feasible. See Cochran (1977, Chapter 11) for more detail and for 
discussion of the relative efficiencies of the different methods. 


Example 6.2 


A videotape hire company has shops in each of 5 regions: three 
regions have 12 shops, the others just 8. To estimate the total number 
Y, of videofilms hired from the company in a particular week, the 
sales manager phones 12 shops chosen by picking 3 regions at 
random and then making a random choice of 3, 5 and 4 shops from 
the chosen regions. 

The results were as follows: 


first region: 260, 296, 182 (12 shops) 
second region: 156, 261, 130, 302, 241 (8 shops) 
third region: 196, 356, 268, 284 (12 shops) 
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The two estimates of Y; are given by Y; (6.24) and by N Y. We have 


M=5 LL; =T¥20 42. t2. 8.8 N =52 
m=3 I, =2,5,4 


5; = 3396 s3 = 5256 s; = 4309 


So we obtain as estimates of Y; 


A 


Y;=13347 = and NY = 13013 


with respective estimated standard error and root mean square error 
of 1640 and 810. 


6.7 Comment 


We have observed that the principal reasons for using cluster sampling tech- 
niques are administrative, rather than statistical, ones. The lack of a complete 
listing of population members, and differing problems of access to different 
groups of population members, may make it laborious, unfeasible, or costly 
to seek an overall s.r. sample from the whole population. Very often the 
population has a hierarchical structure with successive levels containing smaller 
groups of population members, and access to the different levels is facilitated 
by the existence of lists of units at each level. The school survey was an 
illustration of such a structure. The most straightforward method of sampling, 
from the practical viewpoint, is likely to consist of drawing samples at each 
level: a sample of primary units, sub-samples of secondary units, and so on. 
In addition to such relative sampling ease, multi-stage cluster sampling can 
also have the intuitive appeal of seeming to provide a more representative 
coverage of the highly structured population. Furthermore, the clusters may 
be of interest in their own right. 

Statistical considerations enter with respect to the choice of what probability 
sampling scheme should be used, and what sizes of sample should be chosen, 
at the different stages. This choice will depend on the prevailing costs of 
sampling, but once we proceed beyond simple situations (such as described 
in Section 6.6), the appropriate analysis can be highly complex and even 
intangible, due to inadequate knowledge of the relevant cost and population 
variability factors. 

Even when a complete list of basic population members does exist, or can 
be compiled, it is often economically undesirable to take an overall s.r. sample 
from the whole population, and cluster sampling or multi-stage sampling is 
to be preferred. For example, the transport and administrative costs of drawing 
a s.r. sample of schools throughout Scotland (say) could be enormous. We 
could more economically select agents in 5 local education authorities, who 
can speedily and cheaply collect the required information on schools in their 
regions, and the savings in cost for such cluster sampling or multi-stage 
sampling will be large. It is unlikely that any loss of efficiency of estimation 
(for similar sizes of sample) could outweigh the economic considerations. 
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Indeed the efficiency loss could be easily remedied by taking a Jarger cluster 
or multi-stage sample whilst still possibly retaining a substantial cost saving. 


6.8 Exercises 


6.1 A professional association publishes a list of its members. In a survey 
to estimate the average salary of members of the profession, this list is to be 
used as the basis for selecting a sample of about 5% of the 2640 members of 
the profession. Discuss difficulties and the possible advantages and disadvan- 
tages of employing simple random sampling or systematic sampling in the 
three cases: 

(a) where the list is in alphabetical order over all members, 

(b) where the list is in order of length of membership of the professional 

association 

(c) where the list is grouped into different sections (e.g. employment sector, 

or nationality) and is in alphabetical order within each section. 


6.2 Show that in a population consisting of M equal sized clusters, each 
of size L, the intraclass correlation coefficient, p, defined by (6.6), can be 
expressed in terms of the population variance, S*, and average within-cluster 
variance, S*, as 


Seen 


Confirm the result (6.7) for the variance of the one-stage cluster sample mean, 
¥... Discuss the implied restriction on the range of possible values of p, and 
show that 


Var (Vor). wt 


6.3 Acompany, which provides its salesmen with cars for company business 
only, wishes to estimate the average number of miles covered by each car last 
year. The company operates from 12 branches, and the numbers of cars, N,, 
and means and variances (Y, and S?) of miles driven last year (in thousands 
of miles) for each branch, are as follows. 


Branch N; y, Ss 
l 6 24.32 5.07 
2 2 27.06 5.53 
3 11 27.60 6.24 
4 7 28.01 6.59 
> 8 27.56 6.21 
6 14 29.07 6.12 
7 6 32.03 5.97 
8 2 28.41 6.01 
9 2 28.91 5.74 
10 5 25.55 6.78 
11 12 28.58 5.87 
12 6 27.27 5.38 
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Suppose that the average mileage is to be estimated by sampling a few 
branches at random, and using the figures for all the cars at the chosen branches. 
Work out the standard errors of the unbiased cluster sample estimator j.,,) 
for a cluster sample of 4 branches, and of the s.r. sample mean for an s.r. 
sample of 27 cars (this being approximately the average number of cars which 
would be obtained in a cluster sample of 4 branches). Compare the efficiencies 
of the two methods. 


What can be said of the use of y.,.), or of ¥...), in this situation? 


7 


Postscript 


Throughout the earlier Chapters, the Statistics Class data (presented in Figure 
1.1 of Section 1.6) has been used time and again to illustrate the properties 
of estimators for the different sampling methods that have been described. 
Depending on the structure of the population being investigated, on the 
existence of concomitant information on auxiliary variables or on prevailing 
cost information or administrative considerations, the different estimators were 
expected on theoretical grounds to have different relative advantages. These 
were conveniently illustrated by constructing approximate sampling distribu- 
tions for large numbers of samples drawn from the Statistics Class population. 
We were able to see in practice how circumstances affected the properties 
of estimators derived from overall s.r. sampling, from ratio or estimates 
regression, or from stratified, or cluster, sampling. 

As an aide-memoire to many of the results in ths book, Figure 7.1 provides 
a summary of the Statistics Class results by presenting together all the derived 
sampling distributions which have earlier been shown individually in their 
respective contexts. The reader can obtain a useful check on his understanding 
of the broad factors that support the use of different estimators of the popula- 
tion mean Y, by attempting to explain the reasons why the different sampling 
distributions shown in Figure 7.1 appear as they do in relation to one another. 
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7.1. Summary of sampling distributions of estimators 


of Y in the Statistics Class 
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Appendix 


Table 1 Random digits 


51 74 23 99 67 61 32 28 69 84 94 62 67 86 24 98 33 41 19 95 
63 38 06 86 54 99 00 65 26 94 02 82 90 23 07 79 62 67 80 60 
35 30 58 21 46 06 72 17 10 94 25 21 31 75 96 49 28 24 00 49 
63 43 36 82 69 65 51 18 37 88 61 38 44 12 45 32 92 85 88 65 
98 25 37 55 26 01 91 82 81 46 74 71 12 94 97 24 02 71 37 07 
02 63 21 17 69 71 50 80 89 56 38 15 70 11 48 43 40 45 86 98 
64 55 22 21 82 48 22 28 06 00 61 54 13 43 91 82 78 12 23 29 
85 07 26 13 89 01 10 07 82 04 59 63 69 36 03 69 11 15 83 80 
58 54 16 24 15 51 54 44 82 00 62 61 65 04 69 38 18 65 18 97 
34 85 27 84 87 61 48 64 56 26 90 18 48 13 26 37 70 15 42 57 
03 92 18 27 46 57 99 16 96 56 30 33 72 85 22 84 64 38 56 98 
62 95 30 27 59 37 75 41 66 48 86 97 80 61 45 23 53 04 01 63 
08 45 93 15 22 60 21 75 46 91 98 77 27 85 42 28 88 61 08 84 
07 08 55 18 40 45 44 75 13 90 24 94 96 61 02 57 55 66 83 15 
01 85 89 95 66 51 10 19 34 88 15 84 97 19 75 12 76 39 43 78 
72 84 71 14 35 19 11 58 49 26 50 11 17 17 76 86 31 57 20 18 
88 78 28 16 84 13 52 53 94 53 75 45 69 30 96 73 89 65 70 31 
45 17 75 65 57 28 40 19 72 12 25 12 74 75 67 60 40 60 81 19 
96 76 28 12 54 22 01 11 94 25 71 96 16 16 88 68 64 36 74 45 
43 31 67 72 30 24 02 94 08 63 38 32 36 66 02 69 36 38 25 39 
50 44 66 44 21 66 06 58 05 62 68 15 54 35 02 42 35 48 96 32 
22 66 22 15 86 26 63 75 41 99 58 42 36 72 24 58 37 52 18 51 
96 24 40 14 51 23 22 30 88 57 95 67 47 29 83 94 69 40 06 07 
31 73 91 61 19 60 20 72 93 48 98 57 07 23 69 65 95 39 69 58 
78 60 73 99 84 43 89 94 36 45 56 69 47 07 41 90 22 91 07 12 
84 37 90 61 56 70 10 23 98 05 85 11 34 76 60 76 48 45 34 60 
36 67 10 08 23 98 93 35 08 86 99 29 76 29 81 33 34 91 58 93 
07 28 59 07 48 89 64 58 89 75 83 85 62 27 89 30 14 78 56 27 
10 15 83 87 60 79 24 31 66 56 21 48 24 06 93 91 98 94 05 49 
55 19 68 97 65 03 73 52 16 56 00 53 55 90 27 33 42 29 38 87 
53 81 29 13 39 35 01 20 71 34 62 33 74 82 14 53 73 19 09 03 
51 86 32 68 92 33 98 74 66 99 40 14 71 94 58 45 94 19 38 81 
35 91 70 29 13 80 03 54 07 27 96 94 78 32 66 50 95 52 74 33 
37 71 67 95 13 20 02 44 95 94 64 85 04 05 72 01 32 90 76 14 
93 66 13 83 27 92 79 64 64 72 28 54 96 53 84 48 14 52 98 94 
02 96 08 45 65 13 05 00 41 84 93 07 54 72 59 21 45 57 09 77 
49 83 43 48 35 82 88 33 69 96 72 36 04 19 76 47 45 15 18 60 
84 60 71 62 46 40 80 81 30 37 34 39 23 05 38 25 15 35 71 30 
18 17 30 88 71 44 91 14 88 47 89 23 30 63 15 56 34 20 47 89 
79 69 10 61 78 71 32 76 95 62 87 00 22 58 40 92 54 01 75 25 


This table is taken from Fisher and Yates: Statistical Tables for Biological, Agricultural and 
Medical Research, published by Longman Group Ltd., London (previously published by Oliver 
& Boyd, Edinburgh), and by permission of the authors and publishers. 
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Table 2 Double-tailed percentage points, t,(a) and z,, for Student's t-distribution, and the 


Normal distribution. 


t-Distribution 


a 

Degrees of freedom 0.10 0.05 0.02 0.01 0.002 0.001 

V 
1 6.314 12.71 31.82 63.66 318.3 636.6 
2 2.920 4.303 6.965 9.925 22.33 31.60 
3 2.353 3.182 4.541 5.841 10.22 12.94 
4 2.132 2.776 3.747 4.604 T.173 8.610 
5 2.015 2.571 3.365 4.032 5.893 6.859 
6 1.943 2.447 3.143 3.707 5.208 5.959 
7 1.895 2.365 2.998 3.499 4.785 5.405 
8 1.860 2.306 2.896 3.355 4.501 5.041 
9 1.833 2.262 2.821 3.250 4.297 4.781 
10 1.812 2.228 2.764 3.169 4.144 4.587 
12 1.782 2.179 2.681 3.055 3.930 4.318 
14 1.761 2.145 2.624 2.977 3.787 4.140 
16 1.746 2.120 2.583 2.921 3.686 4.015 
18 1.734 2.101 2:552 2.878 3.611 3.922 
20 1.725 2.086 2.528 2.845 3.652 3.850 
25 1.708 2.060 2.485 2.787 3.450 3.725 
30 1.697 2.042 2.457 2.750 3.385 3.646 
40 1.684 2.021 2.423 2.704 3.307 3.551 
60 1.671 2.000 2.390 2.660 3.232 3.460 
80 1.664 1.990 2.374 2.639 3.195 3.415 
0 9) 1.645 1.960 2.326 2.576 3.090 3.291 

Normal distribution 

0.10 0.05 0.02 0.001 0.002 0.001 
Z. 1.645 1.960 2.326 2.576 3.090 3.291 
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Sample surveys are invaluable when 
assessing the many facets of the world we 
live in. This book presents the subject at an 
intermediate mathematical level and is 
suitable for any student following a course 
on survey sampling in addition to its 


_ obvious appeal to statisticians and other 


professionals. Major emphasis is placed on 
‘the construction and use of statistical 
principles and methods for collecting and 
analysing data. Special features include 


* emphasis on the technical difficulties of 
sampling 

* explanation of modern sampling methods 
backed up with numerous examples 

* particular attention given to preliminary 
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